Basic Concepts Reference Manual: A gentle overview Table of Contents 1. Introduction 5 Statistical Packages 5 The WidgeOne Dataset 8 2. Data Analysis and Statistical Concepts 10 Concept 1 – Measurements of Central Tendency 10 Concept 2 – Measurements of Dispersion 23 Concept 3 – Visualization of Univariate Data 28 Concept 4 – Visualization of Multivariate Data 37 Concept 5 – Random Number Generation And Simple Sampling 47 Concept 6 – Confidence Intervals 48 2 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University These reference manuals have been developed to assist students in the basics of statistical computing – sort of a “Statistical Computing for Dummies”. It is not our intention to use this manual to teach statistical concepts1…but rather to demonstrate how to utilize previously taught statistical and data analysis concepts the way that professionals and practitioners apply them – through the able assistance of computing. Proficiency in software allows students to focus more on the interpretation of the output and on the application of results rather than on the mathematical computations. We should pause here and strongly make the point that computers should serve as a medium of expediency of calculation – not as a substitution for the ability to execute a calculation. In the Basic Concepts manual, we present statistical concepts, context for their use, and formulas where appropriate. We provide exercises to execute these concepts by hand. Then, in each subsequent manual, the concepts are applied in a consistent manner using each of the five major statistical computing packages – Excel, SPSS, Minitab, R and SAS. Readers of this manual are assumed to have completed some introductory statistics course. For individuals wishing to review statistical concepts, we recommend Introduction to Stats by DeVeaux, Velleman and Bock. 1 3 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Statistical Packages Used in this Manual We have chosen to incorporate the five most widely used statistical computing packages in these manuals – Excel, SPSS, Minitab, SAS, and R. While each of these packages can be used for basic data analysis, they each have specializations. Any individual who can represent themselves as knowledgeable and proficient in any subset or all of these packages will possess a marketable and differentiating skill set. Excel This spreadsheet software package is ubiquitous. This spreadsheet package represents a very basic and efficient way to organize, analyze and present data. Employers today expect that, at a minimum, new hires with college degrees will have a working knowledge of Excel. Excel is used anywhere that data is available – which is everywhere. Excel is found in offices, libraries, schools, universities, home offices and everywhere in between. In addition to its role as a data analysis package, Excel is often used as a starting point to capture and organize data and then import it into more sophisticated analysis packages such as SPSS, Minitab or SAS. And, after analysis is complete, datasets can be exported back to Excel and shared with others who may not have access to (or have the ability to use) other analysis packages (we gently refer to this group as the “great statistical unwashed”). For product information regarding Excel, please visit: http://office.microsoft.com/en-us SPSS The “Statistical Package for the Social Sciences” or SPSS is one of the most heavily used statistical computing packages in industry. SPSS has over 250,000 customers in 60 countries and is particularly heavily used in Medicine, Psychology, Marketing, Political Science and other social sciences. Because of its more “point and click” orientation, SPSS has become one of the preferred packages of non-statisticians. For product information regarding SPSS, please visit: http://www.spss.com/ 4 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Minitab Minitab was developed by Statistics professors at Penn State University (where it is still headquartered) in 1972. These professors were looking for a better way to teach undergraduate statistics in the classroom. From this starting point, Minitab is now used in over 4,000 universities around the world, in 80 countries and by hundreds of companies ranging from Fortune 500 to startup companies. Of the main statistical computing packages, Minitab has the strongest graphics and visualization capabilities. The package is most heavily used in Six Sigma and other quality design initiatives. Minitab’s customer list includes a large number of manufacturing and product design firms such as Ford, GE, GM, HP and Whirlpool. For product information regarding Minitab, please visit: http://www.minitab.com/ SAS “Statistical Analysis Software” or “SAS” is typically considered to be the most complete statistical analysis package on the market (Professional Tip - please pronounce this as “sass” - if you pronounce the package as “S-A-S” people will think you are a poser). This is the package of choice of most applied statisticians. Although the most recent version of SAS (version 9) includes some point and click options, SAS uses a scripting language to tell the computer what data manipulations and computations to perform. We will be demonstrating how to actually write the code for SAS rather than defaulting to the point and click functionality in v.9, SAS Enterprise Guide, SAS Enterprise Miner and other more user-friendly GUI SAS products . Our rationale here is this – if you learn to drive a manual transmission, you can drive anything. Similarly, if you can program in Base SAS, you can use (and understand) just about any statistical analysis package. The learning curve for SAS is longer and steeper than for the other packages, but the package is considered the benchmark for statistical computing. SAS is used in 110 countries, at 2,200 Universities, and at 96 of the Fortune 100 companies. For product information regarding SAS, please visit: http://www.sas.com/ 5 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University R R is a commands-driven programming environment to execute statistical analysis. Unlike all of the other software packages we have discussed which are proprietary, R is an open-source program that is free and readily available via download from the internet. R is becoming quite popular in quantitative analysis in many fields including statistics, social science research (Psychology, Sociology, Education, etc.), marketing research, business intelligence, etc. R is an implementation of the S-Plus programming language that was originally developed by Bell Labs in the 1970s. For product information regarding R, please visit: http://cran.r-project.org/ 6 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Organization of the Manuals After a brief review of the most common, and we believe essential, statistical/data analysis concepts that every collegeeducated person, regardless of discipline, should know we will then explain how each of these concepts is executed in Excel (2010), SPSS (v.18), Minitab (v.16), SAS (v. 9.2), and R. We have taken a software-oriented approach rather than a statistical concept-oriented approach, because it is the software application rather than the statistical concepts that represent the focus of this document. For example, our first concept is descriptive statistics. Rather than explaining descriptive statistics through each package and then moving into the second analysis concept, we focus on all of the concepts in Excel, and then move to a focus on all of the concepts in SPSS, etc. Yes, we understand that from the reader’s perspective this may be a bit monotonous. After you finish your Ph.D. in Statistics, you can write your manual your way. Throughout each manual, we have used screenshots from the various packages, and have developed easy-to-follow examples using a common dataset. At the end of each manual, we have included a section titled “Lagniappe”. This word derives from New World Spanish la ñapa, “the gift”. The word came into the Creole dialect of New Orleans and there acquired a French spelling. It is still used in the Gulf States, especially southern Louisiana, to denote a little bonus that a friendly shopkeeper might add to a purchase. Our lagniappe for our readers includes the extra and interesting things that we have learned to do with each of these software programs that might not be easily found or well known. A little extra information at no extra cost! 7 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Overview of Dataset Throughout these manuals, we will use a common dataset taken from a small manufacturing company – the WidgeOne company. The WidgeOne dataset: An Excel file – WidgeOne.xls Both qualitative and quantitative variables – 23 variables total Three sheets in one workbook o Plant_Survey o Employees o Attendance 40 observations VARIABLE EMPID PLANT GENDER POSITION JOBSAT YRONJOB JOBGRADE SOCREL PRDCTY Last Name First Name JAN… MEANING Employee ID Plant ID Gender Job Type Job Satisfaction (1-10) Years in current job Job Level (1-10) HR Social Relationship Score (0-10) HR Productivity Rating (out of 100) Employee Last Name Employee First Name Attendance in January (%) VARIABLE TYPE Qualitative Qualitative Qualitative Qualitative Quantitative Quantitative Quantitative Quantitative Quantitative Qualitative Qualitative Quantitative SHEET ALL Plant_Survey Plant_Survey Plant_Survey Plant_Survey Plant_Survey Plant_Survey Plant_Survey Plant_Survey Employees Employees Attendance 8 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Here is a screen shot taken of WidgeOne.xls: 9 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Data Analysis and Statistical Concepts As former practitioners who used statistics on an almost daily basis in our professions in finance, marketing, engineering, manufacturing and medicine, we have developed our “TOP 6” list of the most common and most useful applications of Statistics and Data Analysis. After a brief explanation of each concept, examples will be provided for how to execute these concepts by hand (with a calculator). We cannot emphasize strongly enough that the calculation of the concepts needs to be mastered and fully understood before they can be effectively “outsourced” to a software application. Types of variables There are two distinct types of variables: quantitative and qualitative. Quantitative variables measure how much of something (the quantity) that a unit possesses. For example, in the WidgeOne data set, the quantitative variable YRONJOB measures how many years each employee possesses. Quantitative variables are also known as continuous variables. Qualitative variables identify if an observation belongs to a group. In the WidgeOne data set, Gender is a qualitative variable – it represents whether or not each employee can be qualified as a male or female. Qualitative variables can certainly have number values – such as 0 for male and 1 for female, but these numbers are still gender groups and absolutely cannot be treated as a quantitative value. If an employee has a 1 it indicates that the employee is a female – it does not mean that the employee has more gender than someone with a 0. Qualitative variables are also known as categorical variables. There are two types of qualitative variables: nominal and ordinal. As the name implies, the value of nominal variables carry information about the name of the group they belong to - such as gender and plant. A special case of nominal 10 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University variables are identifier variables. They (you guessed it) serve as a way to identify each observation and carry no other useful information. For purposes of analysis, these are treated as neither quantitative nor qualitative. Ordinal variables, also like the name implies, have a natural inherent order and measure how much of something a subject possesses. Ordinal variables would look like “a little”, “some”, “a lot” or “small”, “medium”, ”large”. Things start to get a little fuzzy here. An Ordinal variable can sometimes be treated as a quantitative (measures the quantity) only if we know “how much more” each category is than the one preceding it. 11 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Concept 1: Measurements of Central Tendency The most common application of Statistics is the measurement of central tendency of a dataset, of which there are three. “Central tendency” is a geeky way of answering the question – “What is the most representative value?” The mean, median, and mode are all measures of central tendency, all measures of the average. If you are reporting or discussing a value as a mean, label it as such. Do not use the words “mean” and “average” interchangeably. The mean is the first and most popular measurement of central tendency because: It is familiar to most people; It reflects the inclusion of every item in the dataset; It always exists; It is unique; It is easily used with other statistical measurements. 12 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University n X The formula for the calculation of a mean is: X i 1 i N Where Xi = every observation in the dataset and N = the number of observations in the dataset We know how everyone LOVES formulas with Greek letters! FUN MANUAL CALCULATION!! Using the WidgeOne.xls dataset, calculate the mean years that men in the Norcross plant (n=10) have been in their current job (YRONJOB). The answer is on the next page…don’t cheat…do it first to make sure that you understand how to calculate this foundational concept by hand. 13 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Did you get 9.66? Well done. A second measurement of central tendency of a dataset is the median. The median is literally, the middle of the dataset: It is the central value of an array of numbers sorted in ascending (or descending) order; 50% of the observations lie below the median and 50% of the observations lie above the median; It represents the second quartile (Q2); It is unique. As with the mean, the median is used when the data is ratio scale (quantitative). However, unlike the mean, the median can accommodate extreme values. FUN MANUAL CALCULATION!! Take the men in the Norcross plant (n=10) again, and determine the median years they have spent in their current job. The answer is on the next page. Did you cheat last time? You can redeem yourself by doing this one by hand… 14 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Did you get 9.5? Well done. The mean and the median are pretty close – 9.66 and 9.50, respectively. But which one is “right”? Which one should be reported as the “central tendency” or the most representative value of the years on the job for the men in the Norcross plant? Mathematically they are both correct, but which one is best? The mean is the best measure of central tendency for quantitative variables under these circumstances: The distribution of the variable in question is unimodal. The distribution is also symmetric. In fact, both the mean and the median require that the distribution of the variable be unimodal. Otherwise, they are both typically misleading and even incorrect. What is unimodal you ask? When referring to the shape of the distribution (which we are) unimodal means there is only one maximum (only one hump). 15 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University The following graphic is an example of unimodal distribution (this is a histogram of 100 men’s heights): 20 Frequency 15 10 5 0 62 64 66 68 70 Height (in inches) 72 74 76 16 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University And here is a bimodal (two hump) distribution (this is a histogram of 200 people’s heights): 40 Frequency 30 20 10 0 52 56 60 64 68 Height (in inches) 72 76 The mean and median height for both of these groups is around 63 inches. You can see that this is an accurate measure of central tendency for the population in the first graphic, but it is certainly misleading for the population in the second graphic where there are actually two locations of central tendency. This is why the mean and the median are only appropriate for unimodal distributions! 17 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University For the mean to be an appropriate measure of central tendency the data has to be symmetric as well as unimodal. The data has a symmetric distribution when the first half of the distribution is a mirror image of the second half. The unimodal graphic of the man height is (roughly) symmetric: 20 Frequency 15 10 5 0 62 64 66 68 70 Height (in inches) 72 74 76 If a distribution is not symmetric, then it is referred to as skewed. Data can be right and left skewed. 18 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Here is an example of right skewed data: 16 14 Frequency 12 10 8 6 4 2 0 37.5 45.0 52.5 60.0 67.5 Generic Variable 75.0 82.5 19 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Here is an example of left skewed data: 16 14 Frequency 12 10 8 6 4 2 0 37.5 45.0 52.5 60.0 67.5 Generic Variable 75.0 82.5 When the data is symmetric, the mean and the median should be pretty close, in which case you would use the mean as the measure of central tendency. If the median and mean are not close, there is evidence that the distribution is skewed. Consider the men in Norcross again. What if employee 082 had 30 years with the company instead of 14 years? How would the mean and median be affected? The mean would increase to 11.26 while the median remains the same at 9.50 (do this by hand to convince yourself of this concept). Go back and look at the formula for the mean and think about why 20 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University the mean was so heavily affected, while the median was not. A boxplot will provide further evidence of symmetry (more on them later). Steps in Identifying the Best Measure of Central Tendency o Ensure that the variable is indeed quantitative (i.e., can be measured with continuous numbers). o Generate and inspect a histogram of the variable and identify its modality (is it unimodal?). Inspect the histogram for approximate symmetry and possible outliers. o Generate and inspect a boxplot. Discuss further evidence of approximate symmetry and the existence of possible outliers. o Compare and contrast the mean and median as a final piece of evidence of symmetry (or non-symmetry). Your Final Decision o When data are unimodal and symmetric, the mean is the best measure of central tendency. o When data are unimodal and non-symmetric (skewed), the median is the best measure of central tendency. o When data are non-unimodal, one should use neither the mean nor the median, but instead present a qualitative description of the shape and modality of the distribution. 21 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University A third measurement of central tendency is the mode. The mode is the most frequently occurring value in a dataset: There can be multiple modes; It is not influenced by extreme observations; Can be used with both qualitative and quantitative data. Go back to the WidgeOne.xls dataset and the men in the Norcross plant. What is the mode for their years on the job? Did you get 14 years? Great! This is a measurement of central tendency. But 14 years is different (a lot different) from 9.66 and 9.50 years. Is it correct? Technically yes, this would be mathematically correct, but not the most appropriate measurement to report as the ‘central tendency’ of the dataset. Typically, the mode is considered to be the weakest of the three measurements of central tendency for quantitative data and is ONLY used if the mean or median is not available. When would that be? Calculate the mean and median gender of the dataset. Go ahead. We will wait. It can’t be done. When the data in question is qualitative (e.g., gender, plant, position) the ONLY measurement of central tendency that is available is the mode. 22 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Concept 2: Measurements of Dispersion When describing a dataset to someone, it’s generally not enough to just provide the measurement of central tendency. You should also provide some measurement of dispersion. We use measurements of dispersion to describe how spread out the data is. We can provide this information in two ways – calculating the standard deviation of the dataset and providing the frequency counts across different ranges of the data. You can think of the standard deviation of a dataset to be the average distance of each observation from the mean. n (X i 1 Here is the formula i X )2 N Where, Xi = each individual observation ̅ = the mean of the dataset N = the number of observations in the dataset Note – if calculating the standard deviation of a sample rather than a population, the denominator becomes n-1. We subtract one degree of freedom. 23 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University The standard deviation provides us with the mean units of each observation from the mean. If this number is large, the data is very spread out (i.e., the observations are different). If this number is small, the data is very compact (i.e., the observations are very similar). FUN MANUAL CALCULATION!! Refer back to the WidgeOne.xls dataset. Calculate the standard deviation of the number of years on the job for the men in Norcross (n=10). Remember that the mean was 9.66 years. The answer is on the next page…don’t cheat…do it first to make sure that you understand how to calculate this foundational concept by hand. 24 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Did you get 3.30? Well done. What does this number MEAN? 3.30 what? It means that the standard deviation of the dataset is 3.30 years. The average deviation (in either direction) of each individual’s tenure is 3.30 years from the mean of 9.66. Relative to the mean, we would consider this data to be fairly compact…meaning that the data is not very spread out (this will be seen more clearly in the next section when a graphical representation is created). You may recall from your earlier Statistics course(s) a second statistical calculation that provides a second measurement of dispersion – the variance. The variance is simply the square of the standard deviation. Although variance is an important concept to statisticians, it is not typically used by practitioners. This is because variance is not very “user friendly” in terms of interpretation. In the case of the men in Norcross, the variance would be reported as “10.88 years squared”. There is another application of the term “variance” that has a more generic meaning that is heavily used by practitioners. It is the difference, either in absolute numbers or percentages, of each observation from some base value. For example, it is common for individuals to refer to a “budget variance”, where this number would be the actual number minus the budgeted number: Project # 123 Budget Hours 150 Actual Hours 175 Variance +25 Variance % +17% Remember when calculating the variance percentage in this context, you take the difference (150-175) divided by the budgeted number (150), not the actual number (many professionals make this mistake…once). 25 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Another method of representing the dispersion of a dataset is to provide the frequency counts for observations across specified ranges. FUN MANUAL CALCULATION!! Using the WidgeOne.xls dataset, determine the number of individuals with job tenure (YRONJOB) in the following categories: Less than 5 years 5 – 10 years More than 10 years Here is how your answer should appear: Category Less than 5 years 5-10 years More than 10 years Total Frequency 9 16 15 40 Relative Frequency 22.50% 40.00% 37.50% 100.00% Cumulative Frequency 22.50% 62.50% 100.00% It is important to note that the categories are mutually exclusive (no observation can occur in two categories simultaneously) and collectively exhaustive (every observation is accommodated). 26 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University This representation of the dispersion of the data is referred to as a frequency table and is the most common and one of the most useful representations of data. In this instance, we converted a quantitative variable into a qualitative variable for the purposes of developing a frequency table. We do this frequently to take a different kind of look at a quantitative variable. If we had a qualitative variable that we wanted to better understand, we would generate the appropriate measurement of central tendency (Mode) and the measurement of dispersion (frequencies) through the application of a frequency table. What you need to know Measurements of dispersion provide information regarding how spread-out or compact the data is. Typically this is communicated through the computation of the standard deviation AND some display of the frequency counts of the observations across specified categories. If the data is qualitative, the only measurement of dispersion comes from the frequency table. 27 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Concept 3: Visualization of Univariate Data Typically, data analysis includes BOTH the computational analysis as well as some visual representation of the analysis. Many recipients of your work will never look at your actual calculations – only your tables and graphs (remember the reference above to the “great statistical unwashed”?). As a result, visual representation of your analysis should receive the same amount of attention and dedication as your computational analysis. Edward Tufte has published several books and articles on the topic of the visualization of data. We recommend is seminal work The Visual Display of Quantitative Information as an excellent reference on the topic. See https://www.edwardtufte.com/. When developing a visual representation of a single variable, the most common tools include – Histograms, Pie Charts, Bar Charts, Box Plots and Stem and Leaf Plots. Each of these will be discussed briefly in turn. Histograms Histograms visually communicate the shape, central tendency and dispersion of the dataset. For this reason, Histograms, are heavily used in conjunction with the measurements of central tendency and the measurements of dispersion to describe a particular variable (like we did while discussing central tendency). Histograms are used with QUANTITATIVE DATA. For all of the packages that we will discuss below, you can simply reference the quantitative variable directly and a Histogram will be generated. 28 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University The following histogram was generated using Minitab: Histogram of Widge One Employee Job Tenure 6 Frequency 5 4 3 2 1 0 0 3 6 9 Years on Job 12 15 18 Note in this graphic that the left axis represents the actual frequency counts and the horizontal axis represents the job tenure of the employees. From this graphic, it is easy to see that the data is (roughly) normally distributed with a mean, median and mode somewhere around 9 years. 29 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Pie Charts Pie charts can be useful for displaying the relative frequency of observations by category, if used properly. They can be used to visualize ordinal data, but bar charts are more appropriate to show the inherent order Consider these two guidelines: o Use 5 or fewer “slices” – if more than 5 slices are needed, use a table; o Order the relative frequencies in ascending (or descending) order. Using the same Job Tenure data, the associated pie chart, generated using Minitab, would look like this: Job Tenure of Widge One Employees Category 5 to 10 Years Less than 5 Years More than 10 Years 37.5% 40.0% 22.5% 30 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University It should probably be noted at this point that approximately 8% of all men and .5% of all women are colorblind. Although colorblindness comes in many different forms, the most common forms involve the colors red, green, yellow and brown. Individuals who are colorblind cannot discern from among these colors. Therefore, when constructing pie charts or any other type of colored visual representation of your analysis, avoid placing these colors adjacent to each other. Bar Charts Bar Charts ARE NOT Histograms! Bar Charts are intended to represent the frequency counts of QUALITATIVE data. The plant information from WidgeOne.xls would look like this: Bar Chart of Plant Employees 25 Count 20 15 10 5 0 Dallas Norcross Plant This bar chart was developed using Minitab. 31 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Bar Charts and Pie Charts are the primary tools used to display qualitative data, but keep in mind that, for ordinal data, bar charts are more appropriate than pie charts. Bar charts are able to illustrate the natural order of the data whereas a pie chart cannot. When using bar charts as a visual of ordinal data, be sure to display the correct order of the data. Remember, when constructing graphical displays of nominal data, most software packages will order the values in alphabetical order, not the natural order. Often times you will have to go in and change it (don’t worry – we will show you how). Stem and Leaf Plots Stem and leaf plots, like histograms, provide a visual representation of the shape of the data and the central tendency of the dataset. Here is the stem and leaf plot for the Job Tenure variable: 2 0 01 7 0 22233 12 0 44555 16 0 6777 (8) 0 88888999 16 1 0000111 9 1 2333 5 1 4445 1 1 7 When reading a stem and leaf plot, the first number represents the “stem” and the numbers to the right represent the “leaves”, while the number to the far right represents the frequency of the stem. For example, the first “stem” of the plot above is a 17 and the first (and only) “leaf” is 0. This means that there is one observation that has 17.0 years on the job. To the far right of the 17, there is a 1. This indicates that there is only one employee with 17.x years on the job. 32 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Boxplots The last tool described in this manual for visualizing univariate data is the boxplot. The boxplot builds on the information displayed in a stem-and-leaf plot and focuses particular attention on the symmetry of the distribution and incorporates numerical measures of tendency and location. Prior to creating a boxplot, you need to be familiar with the concepts of quartiles. The boxplot incorporates the median, the mean and the four quartiles of a variable. The quartiles of a dataset are the points where 25%, 50% (the same as the median), 75% and 100% (the max value) of the data lies below. Quartiles are typically written as Q1, Q2, Q3, Q4, respectively. The data that lies between Q1 and Q3 is referred to as the Interquartile Range or IQR. This is the center 50% of the dataset. 33 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Below is the boxplot for the Job Tenure variable from WidgeOne.xls. Boxplot of Job Tenure 18 16 14 Median Years on Job 12 10 8 6 IQR Box 4 2 0 From this boxplot, you can see that Q1 begins at 5, Q2 (also the median) begins at 8 (the actual median of the dataset is 8.35), Q3 begins at 11 and the highest value of the dataset is 17.0. Notice that the distance from the median line to the top of the IQR box is roughly the same distance as the median line from the bottom of the IQR box. From this, we would conclude that this dataset is relatively symmetric. As previously mentioned while discussing central tendency, box plots are an excellent tool to examine the symmetry of the data and identify potential outliers. 34 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University The following graphic is a box plot of data with a right-skewed distribution: 70 Generic Variable 60 50 40 30 20 You can tell that the distribution is right skewed because the inner-box distances from the median line are not equal and the upper vertical line is longer than the lower. 35 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University The following is a graphic is a boxplot of a left-skewed distribution: 80 Generic Variable 70 60 50 40 The opposite is true for the boxplot above. We can see that the distribution of the generic variable is left-skewed. What you need to know Many individuals, who are analytically very strong, often place insufficient emphasis on graphics and visual representations of data. Many individuals who are not strong analytically, but need analysis to support their decision-making, often place an overemphasis on graphics and visualization. Individuals who can execute both well will go far. Histograms, Stem and Leaf and Boxplots are used with QUANTITATIVE DATA. Bar Charts, Pie Charts, Column Charts are used with QUALITATIVE DATA. 36 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Concept 4: Organization/Visualization of Multivariate Data Frequently, we need to understand and report the relationships between and among variables within a dataset. When developing visual representations of multiple variables, the most common tools include – Contingency Tables (qualitative and quantitative data), Stacked Bar Charts (qualitative data), 100% Stacked Bar Charts (qualitative data), and Scatter plots (quantitative data). Each of these will be discussed briefly in order. Contingency Tables One of the most common and useful methods of displaying the relationships between two or more variables is the contingency table. This table is highly versatile and easily constructed. As an example, let’s take the GENDER and PLANT variables from the WidgeOne.xls dataset. A contingency table of these two variables would look like this: Counts of Employees by Gender and Plant Count of Gender Plant Gender Dallas Norcross Total Female 13 7 20 Male 10 10 20 Total 23 17 40 This table displays the frequency of the number of females and males at each plant. 37 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University We could also display this table as percentages rather than as frequencies. In the following contingency table the percentages are given as a percentage of each gender (row percentages). Specifically, the interpretation of the first cell would be “…of all of the female employees, 65% work in Dallas”. WidgeOne Employees by Gender and Plant Plant Gender D N Total F 65.00% 35.00% 100.00% M 50.00% 50.00% 100.00% Grand Total 57.50% 42.50% 100.00% The percentages could easily be reversed to represent the percentage of individuals at each plant (column percentages): WidgeOne Employees by Gender and Plant Count of Gender Plant Gender Dallas Norcross Total Female 56.52% 41.18% 50.00% Male 43.48% 58.82% 50.00% Total 100.00% 100.00% 100.00% In this version of the table, the first cell now communicates “…of all of the Dallas employees, 56.52% are female.” 38 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Finally, we can also represent the data as overall percentages: WidgeOne Employees by Gender and Plant Count of Gender Plant Gender Dallas Norcross Total Female 32.50% 17.50% 50.00% Male 25.00% 25.00% 50.00% Total 57.50% 42.50% 100.00% In this version of the table, the first cell now communicates”…of all employees, 32.50% are females in Dallas”. Before moving on, please ensure that you fully understand the differences across these three tables. They are subtle, but important. 39 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Both gender and plant are categorical variables. We could incorporate a quantitative variable into this table – such as job tenure: Mean Job Tenure of Employees by Gender and Plant Plant Gender Dallas Norcross Total Female 8.85 6.94 8.19 Male 7.13 9.66 8.40 Grand Total 8.10 8.54 8.29 This table now provides information about the average job tenure for each gender and each plant, and for each gender at each plant. For example, the first cell now communicates, “…The females in Dallas have an average job tenure of 8.85 years”. These contingency tables were created using MS Excel. 40 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Stacked Bar Charts Stacked bars are a convenient way to display percentages or proportions, such as might be done in a pie chart, for multiple variables. For example, the proportion of each gender at each plant would be displayed like this in a stacked bar chart: Bar Chart of Gender by Plant 25 Gender Male Female 20 Count 15 10 5 0 Plant Dallas Norcross This graphic is fine. However, when the population size differs – particularly by a lot – stacked bar charts are less informative. It is difficult to understand how the groups compare. For example, the difference in the number of Dallas and Norcross employees is not dramatic, but even here it is difficult to discern which has a greater proportion of men. 41 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University 100% Stacked Bar Charts To solve this problem, we can apply a 100% stacked bar chart. This visualization tool simply calibrates the populations of interest – like the two plants – to both be evaluated out of a total of 100%. You can almost think of 100% Stacked Bar Charts as side-by-side pie charts. 100% Bar Chart of Gender by Plant Gender Male Female 100 Percent 80 60 40 20 0 Plant Dallas Norcross Percent within levels of Plant. Compare this graphic to the first Stacked Bar Graph. They are different. They communicate subtly different messages. 42 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Scatter Plots What if we wanted to better understand if there is a meaningful relationship between two quantitative variables? Such as the possible relationship between job tenure and productivity. This question can be addressed using a scatter plot, where one quantitative variable is plotted on the y-axis and the second quantitative variable is plotted on the x-axis: Is Job Tenure Related to Productivity? Productivity 100.00 95.00 90.00 85.00 80.00 75.00 70.00 0 5 10 15 20 Job Tenure If two variables are considered to be related, we would expect to see some pattern within the scatter plot, such as a line. If job tenure and productivity were “positively” related, then we would expect to see a 45 degree line moving from the SW corner to the NE corner. This would indicate that as job tenure goes up, productivity goes up. If job tenure and 43 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University productivity were “negatively” related, then we would expect to see a 45 degree line moving from the NW corner to the SE corner. This would indicate that as job tenure goes up, productivity goes down. In this scatter plot, neither of these linear patterns (or any other pattern) is reflected. This “cloud” is referred to as a “Null Plot”. As a result, we would conclude that job tenure and productivity are not related. We can derive additional information from this scatter plot. Specifically, we can determine the “best fit” line – in the form y=mx+b. This is the linear equation that minimizes the distances between the predicted values and the actual values, where y = the predicted values of an employee’s productivity and x = the actual number of years of an employee’s job tenure: y = -0.5715x + 89.318. This equation generates an “R2” value of 0.1124, where this value represents the percentage of the variance of the dependent variable (productivity) that can be explained by the independent variable (job tenure). Detailed explanations of these concepts are outside of the scope of this document, but are heavily used in Statistics and form the basis of Regression Modeling. For a more detailed explanation of Regression Modeling, we recommend Statistical Methods and Data Analysis by Ott and Longnecker. What you need to know Stacked Bar charts are used to display the counts within groupings of qualitative variables. When those groupings are of different sizes, a 100% Stacked Bar Chart is preferred. You can think of 100% Stacked Bar Charts as side by side Pie Charts. Scatterplots are used to communicate if a relationship exists between two quantitative variables. 44 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Side by side Histograms and Box plots Anther way to visually examine the relationship between multiple variables are the side-by-side histograms and box plots. They are similar to their univariate counterparts except that they are separated by another variable so we can compare them side by side. Remember histograms are only appropriate for quantitative data, so let’s look at a histogram of employee Job Tenure again. If we’re going to do side-by-side histograms, they must be grouped by a qualitative variable, like plant location. The following side-by-side histogram shows job tenure by plant for the Widge One employees: Histogram of Widge One Employee Job Tenure by Plant 0 Dallas 4 8 12 16 Norcross 5 Frequency 4 3 2 1 0 0 4 8 12 16 Years on Job Panel variable: Plant Now at a glance, we can see that both plants have roughly the same distribution, but the Dallas plant seems to have more of the less experienced employees than the Norcross plant. 45 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University A side-by-side box plot also has the same requirements – the box plots should be built by quantitative variable and grouped by a qualitative variable. Let’s use the same two variables again, YRONJOB and Plant: Boxplots of Widge One Employee Job Tenure by Plant 18 Dallas Norcross 16 Years on Job 14 12 10 8 6 4 2 0 Panel variable: Plant Nice! Now we can see that the Dallas plant employees have a larger range of job tenures, and that the median job tenure at the Norcross plant is larger than the median job tenure at the Dallas plant. Both the side-by-side histogram and boxplot were generated using Minitab. 46 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Concept 5: Random Number Generation and Simple Random Sampling The statistical concepts covered up to this point would really fall under the heading of “Data Analysis” or “Basic Descriptive Statistics”. These concepts enable us to describe or represent a given dataset to other people and are employed once the data have been gathered. They represent a critical, albeit simple, set of analytical tools. Now let’s take a step back…what if the data NEEDS to be gathered? Entire disciplines exist in the areas of experimental design and sampling. Although the scope of this document does not include an examination of these areas, we will address a foundational concept of these areas – random number generation to support simple random sampling using statistical software. Humans are woefully deficient in our ability to generate truly random numbers. In fact, human “random” number generation is so NOT random, that computer programs have been written that accurately predict the “random” numbers that humans will select. Randomly generated numbers can be forced to follow a particular probability distribution and/or fall between an established minimum and maximum value. We will be generating numbers which follow a uniform distribution, where every number as has the same probability of occurrence. This is the most common execution of random number generation. It should be noted that random numbers could follow any probability distribution (e.g., normal, binomial, Poisson, etc). One of the primary rationales for generating a string of random numbers is to select a sample of observations for analysis. Often, researchers do not have the time the access, or the money to analyze every element in a dataset. Assigning a random number to every element in a dataset and then selecting, for example, the first 50 elements when sorted based upon the random number, is a statistically valid method of sampling. When a uniform distribution is used to generate these random numbers, this process is referred to as simple random sampling – where every element as a 1/n probability of selection. Simple random sampling using random number generation is a very common execution used by analysts to select a subset of a population of elements for analysis. 47 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Concept 6: Confidence Intervals As stated previously, Concepts 1-4 fall under the heading of “Descriptive Statistics”, where the analyst has access to the entire dataset and is simply providing a “description” or visual representation of the central tendency or the dispersion of the dataset. Concept 5 – Random Number Generation – is an important tool that analysts use to subset a dataset or assign elements for survey or additional analysis. When a sample is analyzed for the purposes of better understanding a population, the process is referred to as “Inferential Statistics”2. Here is a brief comparative of Descriptive Statistics and Inferential Statistics: Confidence Example Descriptive Statistics Population (entire dataset) 100% accurate (assuming calculations were done correctly) 100% Measurements of Central Tendency Preference? ALWAYS Preferred! Dataset Accuracy Inferential Statistics Sample from a Population Some Margin of Error will be expected Typically, 90%, 95% or 99% Confidence Intervals around a Population parameter Never preferred…but is accepted as a trade off for cost and/or time. Inferential statistics is based on the Central Limit Theorem. Readers are assumed to have a working knowledge of this theorem. For a refresher on the Central Limit Theorem, we suggest Statistical Methods and Data Analysis by Ott and Longnecker. 2 48 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Concept 6 – Confidence Intervals – therefore is different from the first four concepts reviewed in this manual, because we are moving from descriptive statistics to inferential statistics. Simply stated, a confidence interval is an estimation of some unknown population parameter (usually the mean), based on sample statistics, where the acceptable margin of error and/or confidence level is pre-established. X (Z * The formula used to estimate a two-sided confidence level of a population mean is sX ) n , where X = the sample mean; Z = the number of standard deviations, using the sampling distribution and the Central Limit Theorem, associated with the established confidence level: 90% confidence = 1.645 95% confidence = 1.96 99% confidence = 2.575 Sx= the sample standard deviation; n = the number of elements in the sample. 49 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University p (Z * The formula used to estimate a two-sided confidence level of a population proportion is Where pq ) n p = the sample proportion; q = 1-p; Z = same as above; n = same as above. In both formulas, the expression after the + signs is the referred to as the “Margin of Error”. FUN MANUAL CALCULATION!! Let’s assume that the WidgeOne.xls dataset is a representative sample of a larger manufacturing firm with hundreds of employees in Norcross, GA and Dallas, TX. Let’s also assume that the HR department at WidgeOne has been charged with understanding the level of job satisfaction among employees. For cost reasons, they were unable to survey the entire 50 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University organization, so they surveyed the 40 employees in our dataset. Report the job satisfaction for all WidgeOne employees, using the sample of 40. Use a 95% level of confidence. From the WidgeOne.xls dataset, the mean Job Satisfaction is 6.85 (where 1=low satisfaction and 10 = high satisfaction) and the standard deviation is 1.02. Using the formula above, the confidence interval calculation is: 6.85 + 1.96*(1.02/(SQRT40)) or 6.85 + .32 If you actually gave this number to most people, they would have no idea what it meant. The proper way to communicate this information is: “Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 7.17 and 6.53”. This means that the probability that the “true” mean job satisfaction of all employees, which is unknown, falls between 7.17 and 6.53 is 95%. It also means that there is a 5% probability that the true mean job satisfaction is outside of this range (< 6.53 or > 7.17). What you really need to know When calculating confidence intervals, use a 95% default unless you know something about the decision maker. If the decision maker is conservative, use a 99% interval. If the decision maker is risk tolerant, use a 90% interval. To increase both confidence and decrease the margin of error, increase the sample size. 51 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University Explanatory and response variables The main objective of multivariate analysis is to assess the relationship between two or more variables. A common type of relationship that we examine in statistics is the cause-effect relationship. The variables play two different roles in this relationship – the explanatory role and response role. The response variable is the outcome of interest that is being researched. The explanatory variable is hypothesized to explain or influence the response variable. For example, research studies investigating lung cancer often specify survival status (whether an individual is alive after 20 years) as the response variable and smoking status (whether an individual used smoking tobacco and, if so, what amount) as the explanatory variable. There are specific locations that are traditionally designated for the explanatory and response variables in the analysis methods we’ve discussed. The following table summarizes the proper locations of these variables for each of these analyses. Method of Analysis Location of Explanatory Variable Location of Response Variable Stratified Analysis 1 or more columns Rows Stratified Confidence Intervals 1 or more columns Rows Contingency Table Rows Columns Grouped Histogram Different Panels X-axis Stacked Bar Charts Side-by-Side Boxplots Bars X-axis Stacks Y-axis Scatterplot X-axis Y-axis 52 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University