Basic Concepts Reference Manual - Department of Statistics and

advertisement
Basic Concepts Reference Manual:
A gentle overview
Table of Contents
1. Introduction
5
 Statistical Packages
5
 The WidgeOne Dataset
8
2. Data Analysis and Statistical Concepts
10
 Concept 1 – Measurements of Central Tendency
10
 Concept 2 – Measurements of Dispersion
23
 Concept 3 – Visualization of Univariate Data
28
 Concept 4 – Visualization of Multivariate Data
37
 Concept 5 – Random Number Generation And Simple Sampling
47
 Concept 6 – Confidence Intervals
48
2
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
These reference manuals have been developed to assist students in the basics of statistical computing – sort of a
“Statistical Computing for Dummies”. It is not our intention to use this manual to teach statistical concepts1…but rather
to demonstrate how to utilize previously taught statistical and data analysis concepts the way that professionals and
practitioners apply them – through the able assistance of computing. Proficiency in software allows students to focus
more on the interpretation of the output and on the application of results rather than on the mathematical computations.
We should pause here and strongly make the point that computers should serve as a medium of expediency of calculation
– not as a substitution for the ability to execute a calculation.
In the Basic Concepts manual, we present statistical concepts, context for their use, and formulas where appropriate. We
provide exercises to execute these concepts by hand. Then, in each subsequent manual, the concepts are applied in a
consistent manner using each of the five major statistical computing packages – Excel, SPSS, Minitab, R and SAS.
Readers of this manual are assumed to have completed some introductory statistics course. For individuals wishing to review statistical
concepts, we recommend Introduction to Stats by DeVeaux, Velleman and Bock.
1
3
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Statistical Packages Used in this Manual
We have chosen to incorporate the five most widely used statistical computing packages in these manuals – Excel, SPSS,
Minitab, SAS, and R. While each of these packages can be used for basic data analysis, they each have specializations.
Any individual who can represent themselves as knowledgeable and proficient in any subset or all of these packages will
possess a marketable and differentiating skill set.
Excel This spreadsheet software package is ubiquitous. This spreadsheet package represents a very basic and efficient
way to organize, analyze and present data. Employers today expect that, at a minimum, new hires with college degrees
will have a working knowledge of Excel. Excel is used anywhere that data is available – which is everywhere. Excel is
found in offices, libraries, schools, universities, home offices and everywhere in between.
In addition to its role as a data analysis package, Excel is often used as a starting point to capture and organize data and
then import it into more sophisticated analysis packages such as SPSS, Minitab or SAS. And, after analysis is complete,
datasets can be exported back to Excel and shared with others who may not have access to (or have the ability to use)
other analysis packages (we gently refer to this group as the “great statistical unwashed”).
For product information regarding Excel, please visit: http://office.microsoft.com/en-us
SPSS The “Statistical Package for the Social Sciences” or SPSS is one of the most heavily used statistical computing
packages in industry. SPSS has over 250,000 customers in 60 countries and is particularly heavily used in Medicine,
Psychology, Marketing, Political Science and other social sciences. Because of its more “point and click” orientation, SPSS
has become one of the preferred packages of non-statisticians.
For product information regarding SPSS, please visit: http://www.spss.com/
4
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Minitab Minitab was developed by Statistics professors at Penn State University (where it is still headquartered) in 1972.
These professors were looking for a better way to teach undergraduate statistics in the classroom. From this starting
point, Minitab is now used in over 4,000 universities around the world, in 80 countries and by hundreds of companies
ranging from Fortune 500 to startup companies. Of the main statistical computing packages, Minitab has the strongest
graphics and visualization capabilities. The package is most heavily used in Six Sigma and other quality design
initiatives. Minitab’s customer list includes a large number of manufacturing and product design firms such as Ford, GE,
GM, HP and Whirlpool.
For product information regarding Minitab, please visit: http://www.minitab.com/
SAS “Statistical Analysis Software” or “SAS” is typically considered to be the most complete statistical analysis package
on the market (Professional Tip - please pronounce this as “sass” - if you pronounce the package as “S-A-S” people will
think you are a poser). This is the package of choice of most applied statisticians. Although the most recent version of
SAS (version 9) includes some point and click options, SAS uses a scripting language to tell the computer what data
manipulations and computations to perform. We will be demonstrating how to actually write the code for SAS rather
than defaulting to the point and click functionality in v.9, SAS Enterprise Guide, SAS Enterprise Miner and other more
user-friendly GUI SAS products . Our rationale here is this – if you learn to drive a manual transmission, you can drive
anything. Similarly, if you can program in Base SAS, you can use (and understand) just about any statistical analysis
package. The learning curve for SAS is longer and steeper than for the other packages, but the package is considered the
benchmark for statistical computing. SAS is used in 110 countries, at 2,200 Universities, and at 96 of the Fortune 100
companies.
For product information regarding SAS, please visit: http://www.sas.com/
5
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
R R is a commands-driven programming environment to execute statistical analysis. Unlike all of the other software
packages we have discussed which are proprietary, R is an open-source program that is free and readily available via
download from the internet. R is becoming quite popular in quantitative analysis in many fields including statistics, social
science research (Psychology, Sociology, Education, etc.), marketing research, business intelligence, etc. R is an
implementation of the S-Plus programming language that was originally developed by Bell Labs in the 1970s.
For product information regarding R, please visit: http://cran.r-project.org/
6
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Organization of the Manuals
After a brief review of the most common, and we believe essential, statistical/data analysis concepts that every collegeeducated person, regardless of discipline, should know we will then explain how each of these concepts is executed in
Excel (2010), SPSS (v.18), Minitab (v.16), SAS (v. 9.2), and R.
We have taken a software-oriented approach rather than a statistical concept-oriented approach, because it is the software
application rather than the statistical concepts that represent the focus of this document. For example, our first concept is
descriptive statistics. Rather than explaining descriptive statistics through each package and then moving into the second
analysis concept, we focus on all of the concepts in Excel, and then move to a focus on all of the concepts in SPSS, etc. Yes,
we understand that from the reader’s perspective this may be a bit monotonous. After you finish your Ph.D. in Statistics,
you can write your manual your way.
Throughout each manual, we have used screenshots from the various packages, and have developed easy-to-follow
examples using a common dataset.
At the end of each manual, we have included a section titled “Lagniappe”. This word derives from New World Spanish
la ñapa, “the gift”. The word came into the Creole dialect of New Orleans and there acquired a French spelling. It is still
used in the Gulf States, especially southern Louisiana, to denote a little bonus that a friendly shopkeeper might add to a
purchase.
Our lagniappe for our readers includes the extra and interesting things that we have learned to do with each of these
software programs that might not be easily found or well known. A little extra information at no extra cost!
7
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Overview of Dataset
Throughout these manuals, we will use a common dataset taken from a small manufacturing company – the WidgeOne
company.
The WidgeOne dataset:




An Excel file – WidgeOne.xls
Both qualitative and quantitative variables – 23 variables total
Three sheets in one workbook
o Plant_Survey
o Employees
o Attendance
40 observations
VARIABLE
EMPID
PLANT
GENDER
POSITION
JOBSAT
YRONJOB
JOBGRADE
SOCREL
PRDCTY
Last Name
First Name
JAN…
MEANING
Employee ID
Plant ID
Gender
Job Type
Job Satisfaction (1-10)
Years in current job
Job Level (1-10)
HR Social Relationship Score (0-10)
HR Productivity Rating (out of 100)
Employee Last Name
Employee First Name
Attendance in January (%)
VARIABLE TYPE
Qualitative
Qualitative
Qualitative
Qualitative
Quantitative
Quantitative
Quantitative
Quantitative
Quantitative
Qualitative
Qualitative
Quantitative
SHEET
ALL
Plant_Survey
Plant_Survey
Plant_Survey
Plant_Survey
Plant_Survey
Plant_Survey
Plant_Survey
Plant_Survey
Employees
Employees
Attendance
8
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Here is a screen shot taken of WidgeOne.xls:
9
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Data Analysis and Statistical Concepts
As former practitioners who used statistics on an almost daily basis in our professions in finance, marketing, engineering,
manufacturing and medicine, we have developed our “TOP 6” list of the most common and most useful applications of
Statistics and Data Analysis.
After a brief explanation of each concept, examples will be provided for how to execute these concepts by hand (with a
calculator). We cannot emphasize strongly enough that the calculation of the concepts needs to be mastered and fully
understood before they can be effectively “outsourced” to a software application.
Types of variables
There are two distinct types of variables: quantitative and qualitative. Quantitative variables measure how much of
something (the quantity) that a unit possesses. For example, in the WidgeOne data set, the quantitative variable
YRONJOB measures how many years each employee possesses. Quantitative variables are also known as continuous
variables.
Qualitative variables identify if an observation belongs to a group. In the WidgeOne data set, Gender is a qualitative
variable – it represents whether or not each employee can be qualified as a male or female. Qualitative variables can
certainly have number values – such as 0 for male and 1 for female, but these numbers are still gender groups and
absolutely cannot be treated as a quantitative value. If an employee has a 1 it indicates that the employee is a female – it
does not mean that the employee has more gender than someone with a 0. Qualitative variables are also known as
categorical variables.
There are two types of qualitative variables: nominal and ordinal. As the name implies, the value of nominal variables
carry information about the name of the group they belong to - such as gender and plant. A special case of nominal
10
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
variables are identifier variables. They (you guessed it) serve as a way to identify each observation and carry no other
useful information. For purposes of analysis, these are treated as neither quantitative nor qualitative.
Ordinal variables, also like the name implies, have a natural inherent order and measure how much of something a
subject possesses. Ordinal variables would look like “a little”, “some”, “a lot” or “small”, “medium”, ”large”. Things start
to get a little fuzzy here. An Ordinal variable can sometimes be treated as a quantitative (measures the quantity) only if we
know “how much more” each category is than the one preceding it.
11
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 1: Measurements of Central Tendency
The most common application of Statistics is the measurement of central tendency of a dataset, of which there are three.
“Central tendency” is a geeky way of answering the question – “What is the most representative value?” The mean, median,
and mode are all measures of central tendency, all measures of the average. If you are reporting or discussing a value as a
mean, label it as such. Do not use the words “mean” and “average” interchangeably.
The mean is the first and most popular measurement of central tendency because:
It is familiar to most people;
It reflects the inclusion of every item in the dataset;
It always exists;
It is unique;
It is easily used with other statistical measurements.
12
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
n
X
The formula for the calculation of a mean is:
X
i 1
i
N
Where Xi = every observation in the dataset and N = the number of observations in the dataset
We know how everyone LOVES formulas with Greek letters!
FUN MANUAL CALCULATION!!
Using the WidgeOne.xls dataset, calculate the mean years that men in the Norcross plant (n=10) have been in their current
job (YRONJOB). The answer is on the next page…don’t cheat…do it first to make sure that you understand how to
calculate this foundational concept by hand.
13
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Did you get 9.66? Well done.
A second measurement of central tendency of a dataset is the median. The median is literally, the middle of the dataset:
It is the central value of an array of numbers sorted in ascending (or descending) order;
50% of the observations lie below the median and 50% of the observations lie above the median;
It represents the second quartile (Q2);
It is unique.
As with the mean, the median is used when the data is ratio scale (quantitative). However, unlike the mean, the median
can accommodate extreme values.
FUN MANUAL CALCULATION!!
Take the men in the Norcross plant (n=10) again, and determine the median years they have spent in their current job.
The answer is on the next page. Did you cheat last time? You can redeem yourself by doing this one by hand…
14
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Did you get 9.5? Well done.
The mean and the median are pretty close – 9.66 and 9.50, respectively. But which one is “right”? Which one should be
reported as the “central tendency” or the most representative value of the years on the job for the men in the Norcross
plant? Mathematically they are both correct, but which one is best?
The mean is the best measure of central tendency for quantitative variables under these circumstances:
The distribution of the variable in question is unimodal.
The distribution is also symmetric.
In fact, both the mean and the median require that the distribution of the variable be unimodal. Otherwise, they are both
typically misleading and even incorrect.
What is unimodal you ask? When referring to the shape of the distribution (which we are) unimodal means there is only
one maximum (only one hump).
15
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
The following graphic is an example of unimodal distribution (this is a histogram of 100 men’s heights):
20
Frequency
15
10
5
0
62
64
66
68
70
Height (in inches)
72
74
76
16
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
And here is a bimodal (two hump) distribution (this is a histogram of 200 people’s heights):
40
Frequency
30
20
10
0
52
56
60
64
68
Height (in inches)
72
76
The mean and median height for both of these groups is around 63 inches. You can see that this is an accurate measure of
central tendency for the population in the first graphic, but it is certainly misleading for the population in the second
graphic where there are actually two locations of central tendency. This is why the mean and the median are only
appropriate for unimodal distributions!
17
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
For the mean to be an appropriate measure of central tendency the data has to be symmetric as well as unimodal. The
data has a symmetric distribution when the first half of the distribution is a mirror image of the second half. The
unimodal graphic of the man height is (roughly) symmetric:
20
Frequency
15
10
5
0
62
64
66
68
70
Height (in inches)
72
74
76
If a distribution is not symmetric, then it is referred to as skewed. Data can be right and left skewed.
18
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Here is an example of right skewed data:
16
14
Frequency
12
10
8
6
4
2
0
37.5
45.0
52.5
60.0
67.5
Generic Variable
75.0
82.5
19
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Here is an example of left skewed data:
16
14
Frequency
12
10
8
6
4
2
0
37.5
45.0
52.5
60.0
67.5
Generic Variable
75.0
82.5
When the data is symmetric, the mean and the median should be pretty close, in which case you would use the mean as
the measure of central tendency. If the median and mean are not close, there is evidence that the distribution is skewed.
Consider the men in Norcross again. What if employee 082 had 30 years with the company instead of 14 years? How
would the mean and median be affected? The mean would increase to 11.26 while the median remains the same at 9.50
(do this by hand to convince yourself of this concept). Go back and look at the formula for the mean and think about why
20
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
the mean was so heavily affected, while the median was not. A boxplot will provide further evidence of symmetry (more
on them later).
Steps in Identifying the Best Measure of Central Tendency
o Ensure that the variable is indeed quantitative (i.e., can be measured with continuous numbers).
o Generate and inspect a histogram of the variable and identify its modality (is it unimodal?). Inspect the histogram
for approximate symmetry and possible outliers.
o Generate and inspect a boxplot. Discuss further evidence of approximate symmetry and the existence of possible
outliers.
o Compare and contrast the mean and median as a final piece of evidence of symmetry (or non-symmetry).
Your Final Decision
o When data are unimodal and symmetric, the mean is the best measure of central tendency.
o When data are unimodal and non-symmetric (skewed), the median is the best measure of central tendency.
o When data are non-unimodal, one should use neither the mean nor the median, but instead present a qualitative
description of the shape and modality of the distribution.
21
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
A third measurement of central tendency is the mode. The mode is the most frequently occurring value in a dataset:
There can be multiple modes;
It is not influenced by extreme observations;
Can be used with both qualitative and quantitative data.
Go back to the WidgeOne.xls dataset and the men in the Norcross plant. What is the mode for their years on the job? Did
you get 14 years? Great! This is a measurement of central tendency. But 14 years is different (a lot different) from 9.66 and
9.50 years. Is it correct?
Technically yes, this would be mathematically correct, but not the most appropriate measurement to report as the ‘central
tendency’ of the dataset. Typically, the mode is considered to be the weakest of the three measurements of central
tendency for quantitative data and is ONLY used if the mean or median is not available. When would that be?
Calculate the mean and median gender of the dataset. Go ahead. We will wait.
It can’t be done. When the data in question is qualitative (e.g., gender, plant, position) the ONLY measurement of central
tendency that is available is the mode.
22
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 2: Measurements of Dispersion
When describing a dataset to someone, it’s generally not enough to just provide the measurement of central tendency.
You should also provide some measurement of dispersion.
We use measurements of dispersion to describe how spread out the data is. We can provide this information in two ways
– calculating the standard deviation of the dataset and providing the frequency counts across different ranges of the data.
You can think of the standard deviation of a dataset to be the average distance of each observation from the mean.
n
(X
i 1
Here is the formula
i
 X )2
N
Where, Xi = each individual observation
̅ = the mean of the dataset
N = the number of observations in the dataset
Note – if calculating the standard deviation of a sample rather than a population, the denominator becomes n-1. We
subtract one degree of freedom.
23
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
The standard deviation provides us with the mean units of each observation from the mean. If this number is large, the
data is very spread out (i.e., the observations are different). If this number is small, the data is very compact (i.e., the
observations are very similar).
FUN MANUAL CALCULATION!!
Refer back to the WidgeOne.xls dataset. Calculate the standard deviation of the number of years on the job for the men in
Norcross (n=10). Remember that the mean was 9.66 years.
The answer is on the next page…don’t cheat…do it first to make sure that you understand how to calculate this
foundational concept by hand.
24
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Did you get 3.30? Well done.
What does this number MEAN? 3.30 what? It means that the standard deviation of the dataset is 3.30 years. The average
deviation (in either direction) of each individual’s tenure is 3.30 years from the mean of 9.66. Relative to the mean, we
would consider this data to be fairly compact…meaning that the data is not very spread out (this will be seen more clearly
in the next section when a graphical representation is created).
You may recall from your earlier Statistics course(s) a second statistical calculation that provides a second measurement
of dispersion – the variance. The variance is simply the square of the standard deviation. Although variance is an
important concept to statisticians, it is not typically used by practitioners. This is because variance is not very “user
friendly” in terms of interpretation. In the case of the men in Norcross, the variance would be reported as “10.88 years
squared”.
There is another application of the term “variance” that has a more generic meaning that is heavily used by practitioners.
It is the difference, either in absolute numbers or percentages, of each observation from some base value.
For example, it is common for individuals to refer to a “budget variance”, where this number would be the actual number
minus the budgeted number:
Project #
123
Budget Hours
150
Actual Hours
175
Variance
+25
Variance %
+17%
Remember when calculating the variance percentage in this context, you take the difference (150-175) divided by the
budgeted number (150), not the actual number (many professionals make this mistake…once).
25
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Another method of representing the dispersion of a dataset is to provide the frequency counts for observations across
specified ranges.
FUN MANUAL CALCULATION!!
Using the WidgeOne.xls dataset, determine the number of individuals with job tenure (YRONJOB) in the following
categories:
Less than 5 years
5 – 10 years
More than 10 years
Here is how your answer should appear:
Category
Less than 5 years
5-10 years
More than 10 years
Total
Frequency
9
16
15
40
Relative Frequency
22.50%
40.00%
37.50%
100.00%
Cumulative Frequency
22.50%
62.50%
100.00%
It is important to note that the categories are mutually exclusive (no observation can occur in two categories
simultaneously) and collectively exhaustive (every observation is accommodated).
26
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
This representation of the dispersion of the data is referred to as a frequency table and is the most common and one of the
most useful representations of data.
In this instance, we converted a quantitative variable into a qualitative variable for the purposes of developing a
frequency table. We do this frequently to take a different kind of look at a quantitative variable.
If we had a qualitative variable that we wanted to better understand, we would generate the appropriate measurement of
central tendency (Mode) and the measurement of dispersion (frequencies) through the application of a frequency table.
What you need to know  Measurements of dispersion provide information regarding how spread-out or compact the
data is. Typically this is communicated through the computation of the standard deviation AND some display of the
frequency counts of the observations across specified categories. If the data is qualitative, the only measurement of
dispersion comes from the frequency table.
27
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 3: Visualization of Univariate Data
Typically, data analysis includes BOTH the computational analysis as well as some visual representation of the analysis.
Many recipients of your work will never look at your actual calculations – only your tables and graphs (remember the
reference above to the “great statistical unwashed”?). As a result, visual representation of your analysis should receive
the same amount of attention and dedication as your computational analysis.
Edward Tufte has published several books and articles on the topic of the visualization of data. We recommend is
seminal work The Visual Display of Quantitative Information as an excellent reference on the topic. See
https://www.edwardtufte.com/.
When developing a visual representation of a single variable, the most common tools include – Histograms, Pie Charts,
Bar Charts, Box Plots and Stem and Leaf Plots. Each of these will be discussed briefly in turn.
Histograms Histograms visually communicate the shape, central tendency and dispersion of the dataset. For this reason,
Histograms, are heavily used in conjunction with the measurements of central tendency and the measurements of
dispersion to describe a particular variable (like we did while discussing central tendency). Histograms are used with
QUANTITATIVE DATA. For all of the packages that we will discuss below, you can simply reference the quantitative
variable directly and a Histogram will be generated.
28
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
The following histogram was generated using Minitab:
Histogram of Widge One Employee Job Tenure
6
Frequency
5
4
3
2
1
0
0
3
6
9
Years on Job
12
15
18
Note in this graphic that the left axis represents the actual frequency counts and the horizontal axis represents the job
tenure of the employees. From this graphic, it is easy to see that the data is (roughly) normally distributed with a mean,
median and mode somewhere around 9 years.
29
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Pie Charts Pie charts can be useful for displaying the relative frequency of observations by category, if used properly.
They can be used to visualize ordinal data, but bar charts are more appropriate to show the inherent order
Consider these two guidelines:
o Use 5 or fewer “slices” – if more than 5 slices are needed, use a table;
o Order the relative frequencies in ascending (or descending) order.
Using the same Job Tenure data, the associated pie chart, generated using Minitab, would look like this:
Job Tenure of Widge One Employees
Category
5 to 10 Years
Less than 5 Years
More than 10 Years
37.5%
40.0%
22.5%
30
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
It should probably be noted at this point that approximately 8% of all men and .5% of all women are colorblind.
Although colorblindness comes in many different forms, the most common forms involve the colors red, green, yellow
and brown. Individuals who are colorblind cannot discern from among these colors. Therefore, when constructing pie
charts or any other type of colored visual representation of your analysis, avoid placing these colors adjacent to each
other.
Bar Charts Bar Charts ARE NOT Histograms! Bar Charts are intended to represent the frequency counts of
QUALITATIVE data. The plant information from WidgeOne.xls would look like this:
Bar Chart of Plant Employees
25
Count
20
15
10
5
0
Dallas
Norcross
Plant
This bar chart was developed using Minitab.
31
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Bar Charts and Pie Charts are the primary tools used to display qualitative data, but keep in mind that, for ordinal data,
bar charts are more appropriate than pie charts. Bar charts are able to illustrate the natural order of the data whereas a pie
chart cannot. When using bar charts as a visual of ordinal data, be sure to display the correct order of the data.
Remember, when constructing graphical displays of nominal data, most software packages will order the values in
alphabetical order, not the natural order. Often times you will have to go in and change it (don’t worry – we will show
you how).
Stem and Leaf Plots Stem and leaf plots, like histograms, provide a visual representation of the shape of the data and the
central tendency of the dataset. Here is the stem and leaf plot for the Job Tenure variable:
2 0 01
7 0 22233
12 0 44555
16 0 6777
(8) 0 88888999
16 1 0000111
9 1 2333
5 1 4445
1 1 7
When reading a stem and leaf plot, the first number represents the “stem” and the numbers to the right represent the
“leaves”, while the number to the far right represents the frequency of the stem. For example, the first “stem” of the plot
above is a 17 and the first (and only) “leaf” is 0. This means that there is one observation that has 17.0 years on the job.
To the far right of the 17, there is a 1. This indicates that there is only one employee with 17.x years on the job.
32
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Boxplots The last tool described in this manual for visualizing univariate data is the boxplot. The boxplot builds on the
information displayed in a stem-and-leaf plot and focuses particular attention on the symmetry of the distribution and
incorporates numerical measures of tendency and location.
Prior to creating a boxplot, you need to be familiar with the concepts of quartiles. The boxplot incorporates the median,
the mean and the four quartiles of a variable. The quartiles of a dataset are the points where 25%, 50% (the same as the
median), 75% and 100% (the max value) of the data lies below. Quartiles are typically written as Q1, Q2, Q3, Q4,
respectively. The data that lies between Q1 and Q3 is referred to as the Interquartile Range or IQR. This is the center 50%
of the dataset.
33
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Below is the boxplot for the Job Tenure variable from WidgeOne.xls.
Boxplot of Job Tenure
18
16
14
Median
Years on Job
12
10
8
6
IQR
Box
4
2
0
From this boxplot, you can see that Q1 begins at 5, Q2 (also the median) begins at 8 (the actual median of the dataset is
8.35), Q3 begins at 11 and the highest value of the dataset is 17.0. Notice that the distance from the median line to the top
of the IQR box is roughly the same distance as the median line from the bottom of the IQR box. From this, we would
conclude that this dataset is relatively symmetric.
As previously mentioned while discussing central tendency, box plots are an excellent tool to examine the symmetry of
the data and identify potential outliers.
34
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
The following graphic is a box plot of data with a right-skewed distribution:
70
Generic Variable
60
50
40
30
20
You can tell that the distribution is right skewed because the inner-box distances from the median line are not equal and
the upper vertical line is longer than the lower.
35
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
The following is a graphic is a boxplot of a left-skewed distribution:
80
Generic Variable
70
60
50
40
The opposite is true for the boxplot above. We can see that the distribution of the generic variable is left-skewed.
What you need to know  Many individuals, who are analytically very strong, often place insufficient emphasis on
graphics and visual representations of data. Many individuals who are not strong analytically, but need analysis to
support their decision-making, often place an overemphasis on graphics and visualization. Individuals who can execute
both well will go far. Histograms, Stem and Leaf and Boxplots are used with QUANTITATIVE DATA. Bar Charts, Pie
Charts, Column Charts are used with QUALITATIVE DATA.
36
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 4: Organization/Visualization of Multivariate Data
Frequently, we need to understand and report the relationships between and among variables within a dataset. When
developing visual representations of multiple variables, the most common tools include – Contingency Tables (qualitative
and quantitative data), Stacked Bar Charts (qualitative data), 100% Stacked Bar Charts (qualitative data), and Scatter plots
(quantitative data). Each of these will be discussed briefly in order.
Contingency Tables One of the most common and useful methods of displaying the relationships between two or more
variables is the contingency table. This table is highly versatile and easily constructed. As an example, let’s take the
GENDER and PLANT variables from the WidgeOne.xls dataset. A contingency table of these two variables would look
like this:
Counts of Employees by Gender and Plant
Count of Gender Plant
Gender
Dallas
Norcross
Total
Female
13
7
20
Male
10
10
20
Total
23
17
40
This table displays the frequency of the number of females and males at each plant.
37
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
We could also display this table as percentages rather than as frequencies. In the following contingency table the
percentages are given as a percentage of each gender (row percentages). Specifically, the interpretation of the first cell
would be “…of all of the female employees, 65% work in Dallas”.
WidgeOne Employees by Gender and Plant
Plant
Gender
D
N
Total
F
65.00%
35.00%
100.00%
M
50.00%
50.00%
100.00%
Grand Total
57.50%
42.50%
100.00%
The percentages could easily be reversed to represent the percentage of individuals at each plant (column percentages):
WidgeOne Employees by Gender and Plant
Count of Gender Plant
Gender
Dallas
Norcross
Total
Female
56.52%
41.18%
50.00%
Male
43.48%
58.82%
50.00%
Total
100.00%
100.00%
100.00%
In this version of the table, the first cell now communicates “…of all of the Dallas employees, 56.52% are female.”
38
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Finally, we can also represent the data as overall percentages:
WidgeOne Employees by Gender and Plant
Count of Gender Plant
Gender
Dallas
Norcross
Total
Female
32.50%
17.50%
50.00%
Male
25.00%
25.00%
50.00%
Total
57.50%
42.50%
100.00%
In this version of the table, the first cell now communicates”…of all employees, 32.50% are females in Dallas”.
Before moving on, please ensure that you fully understand the differences across these three tables. They are subtle, but
important.
39
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Both gender and plant are categorical variables. We could incorporate a quantitative variable into this table – such as job
tenure:
Mean Job Tenure of Employees by Gender and Plant
Plant
Gender
Dallas
Norcross
Total
Female
8.85
6.94
8.19
Male
7.13
9.66
8.40
Grand Total
8.10
8.54
8.29
This table now provides information about the average job tenure for each gender and each plant, and for each gender at
each plant. For example, the first cell now communicates, “…The females in Dallas have an average job tenure of 8.85 years”.
These contingency tables were created using MS Excel.
40
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Stacked Bar Charts Stacked bars are a convenient way to display percentages or proportions, such as might be done in a
pie chart, for multiple variables. For example, the proportion of each gender at each plant would be displayed like this in
a stacked bar chart:
Bar Chart of Gender by Plant
25
Gender
Male
Female
20
Count
15
10
5
0
Plant
Dallas
Norcross
This graphic is fine. However, when the population size differs – particularly by a lot – stacked bar charts are less
informative. It is difficult to understand how the groups compare. For example, the difference in the number of Dallas
and Norcross employees is not dramatic, but even here it is difficult to discern which has a greater proportion of men.
41
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
100% Stacked Bar Charts To solve this problem, we can apply a 100% stacked bar chart. This visualization tool simply
calibrates the populations of interest – like the two plants – to both be evaluated out of a total of 100%. You can almost
think of 100% Stacked Bar Charts as side-by-side pie charts.
100% Bar Chart of Gender by Plant
Gender
Male
Female
100
Percent
80
60
40
20
0
Plant
Dallas
Norcross
Percent within levels of Plant.
Compare this graphic to the first Stacked Bar Graph. They are different. They communicate subtly different messages.
42
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Scatter Plots What if we wanted to better understand if there is a meaningful relationship between two quantitative
variables? Such as the possible relationship between job tenure and productivity.
This question can be addressed using a scatter plot, where one quantitative variable is plotted on the y-axis and the
second quantitative variable is plotted on the x-axis:
Is Job Tenure Related to Productivity?
Productivity
100.00
95.00
90.00
85.00
80.00
75.00
70.00
0
5
10
15
20
Job Tenure
If two variables are considered to be related, we would expect to see some pattern within the scatter plot, such as a line. If
job tenure and productivity were “positively” related, then we would expect to see a 45 degree line moving from the SW
corner to the NE corner. This would indicate that as job tenure goes up, productivity goes up. If job tenure and
43
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
productivity were “negatively” related, then we would expect to see a 45 degree line moving from the NW corner to the
SE corner. This would indicate that as job tenure goes up, productivity goes down.
In this scatter plot, neither of these linear patterns (or any other pattern) is reflected. This “cloud” is referred to as a “Null
Plot”. As a result, we would conclude that job tenure and productivity are not related.
We can derive additional information from this scatter plot. Specifically, we can determine the “best fit” line – in the form
y=mx+b. This is the linear equation that minimizes the distances between the predicted values and the actual values,
where y = the predicted values of an employee’s productivity and x = the actual number of years of an employee’s job
tenure: y = -0.5715x + 89.318. This equation generates an “R2” value of 0.1124, where this value represents the percentage
of the variance of the dependent variable (productivity) that can be explained by the independent variable (job tenure).
Detailed explanations of these concepts are outside of the scope of this document, but are heavily used in Statistics and
form the basis of Regression Modeling. For a more detailed explanation of Regression Modeling, we recommend
Statistical Methods and Data Analysis by Ott and Longnecker.
What you need to know  Stacked Bar charts are used to display the counts within groupings of qualitative variables.
When those groupings are of different sizes, a 100% Stacked Bar Chart is preferred. You can think of 100% Stacked Bar
Charts as side by side Pie Charts. Scatterplots are used to communicate if a relationship exists between two quantitative
variables.
44
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Side by side Histograms and Box plots Anther way to visually examine the relationship between multiple variables are
the side-by-side histograms and box plots. They are similar to their univariate counterparts except that they are separated
by another variable so we can compare them side by side.
Remember histograms are only appropriate for quantitative data, so let’s look at a histogram of employee Job Tenure
again. If we’re going to do side-by-side histograms, they must be grouped by a qualitative variable, like plant location.
The following side-by-side histogram shows job tenure by plant for the Widge One employees:
Histogram of Widge One Employee Job Tenure by Plant
0
Dallas
4
8
12
16
Norcross
5
Frequency
4
3
2
1
0
0
4
8
12
16
Years on Job
Panel variable: Plant
Now at a glance, we can see that both plants have roughly the same distribution, but the Dallas plant seems to have more
of the less experienced employees than the Norcross plant.
45
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
A side-by-side box plot also has the same requirements – the box plots should be built by quantitative variable and
grouped by a qualitative variable. Let’s use the same two variables again, YRONJOB and Plant:
Boxplots of Widge One Employee Job Tenure by Plant
18
Dallas
Norcross
16
Years on Job
14
12
10
8
6
4
2
0
Panel variable: Plant
Nice! Now we can see that the Dallas plant employees have a larger range of job tenures, and that the median job tenure
at the Norcross plant is larger than the median job tenure at the Dallas plant.
Both the side-by-side histogram and boxplot were generated using Minitab.
46
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 5: Random Number Generation and Simple Random Sampling
The statistical concepts covered up to this point would really fall under the heading of “Data Analysis” or “Basic
Descriptive Statistics”. These concepts enable us to describe or represent a given dataset to other people and are
employed once the data have been gathered. They represent a critical, albeit simple, set of analytical tools. Now let’s take
a step back…what if the data NEEDS to be gathered?
Entire disciplines exist in the areas of experimental design and sampling. Although the scope of this document does not
include an examination of these areas, we will address a foundational concept of these areas – random number generation
to support simple random sampling using statistical software. Humans are woefully deficient in our ability to generate
truly random numbers. In fact, human “random” number generation is so NOT random, that computer programs have
been written that accurately predict the “random” numbers that humans will select.
Randomly generated numbers can be forced to follow a particular probability distribution and/or fall between an
established minimum and maximum value. We will be generating numbers which follow a uniform distribution, where
every number as has the same probability of occurrence. This is the most common execution of random number
generation. It should be noted that random numbers could follow any probability distribution (e.g., normal, binomial,
Poisson, etc).
One of the primary rationales for generating a string of random numbers is to select a sample of observations for analysis.
Often, researchers do not have the time the access, or the money to analyze every element in a dataset. Assigning a
random number to every element in a dataset and then selecting, for example, the first 50 elements when sorted based
upon the random number, is a statistically valid method of sampling. When a uniform distribution is used to generate
these random numbers, this process is referred to as simple random sampling – where every element as a 1/n probability
of selection. Simple random sampling using random number generation is a very common execution used by analysts to
select a subset of a population of elements for analysis.
47
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 6: Confidence Intervals
As stated previously, Concepts 1-4 fall under the heading of “Descriptive Statistics”, where the analyst has access to the
entire dataset and is simply providing a “description” or visual representation of the central tendency or the dispersion of
the dataset. Concept 5 – Random Number Generation – is an important tool that analysts use to subset a dataset or assign
elements for survey or additional analysis. When a sample is analyzed for the purposes of better understanding a
population, the process is referred to as “Inferential Statistics”2. Here is a brief comparative of Descriptive Statistics and
Inferential Statistics:
Confidence
Example
Descriptive Statistics
Population (entire dataset)
100% accurate (assuming calculations were done
correctly)
100%
Measurements of Central Tendency
Preference?
ALWAYS Preferred!
Dataset
Accuracy
Inferential Statistics
Sample from a Population
Some Margin of Error will be expected
Typically, 90%, 95% or 99%
Confidence Intervals around a Population
parameter
Never preferred…but is accepted as a
trade off for cost and/or time.
Inferential statistics is based on the Central Limit Theorem. Readers are assumed to have a working knowledge of this theorem. For a refresher
on the Central Limit Theorem, we suggest Statistical Methods and Data Analysis by Ott and Longnecker.
2
48
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 6 – Confidence Intervals – therefore is different from the first four concepts reviewed in this manual, because we
are moving from descriptive statistics to inferential statistics.
Simply stated, a confidence interval is an estimation of some unknown population parameter (usually the mean), based
on sample statistics, where the acceptable margin of error and/or confidence level is pre-established.
X  (Z *
The formula used to estimate a two-sided confidence level of a population mean is
sX
)
n , where
X = the sample mean;
Z = the number of standard deviations, using the sampling distribution and the Central Limit Theorem, associated with
the established confidence level:
90% confidence = 1.645
95% confidence = 1.96
99% confidence = 2.575
Sx= the sample standard deviation;
n = the number of elements in the sample.
49
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
p  (Z *
The formula used to estimate a two-sided confidence level of a population proportion is
Where
pq
)
n
p = the sample proportion;
q = 1-p;
Z = same as above;
n = same as above.
In both formulas, the expression after the + signs is the referred to as the “Margin of Error”.
FUN MANUAL CALCULATION!!
Let’s assume that the WidgeOne.xls dataset is a representative sample of a larger manufacturing firm with hundreds of
employees in Norcross, GA and Dallas, TX. Let’s also assume that the HR department at WidgeOne has been charged
with understanding the level of job satisfaction among employees. For cost reasons, they were unable to survey the entire
50
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
organization, so they surveyed the 40 employees in our dataset. Report the job satisfaction for all WidgeOne employees,
using the sample of 40. Use a 95% level of confidence.
From the WidgeOne.xls dataset, the mean Job Satisfaction is 6.85 (where 1=low satisfaction and 10 = high satisfaction) and
the standard deviation is 1.02. Using the formula above, the confidence interval calculation is:
6.85 + 1.96*(1.02/(SQRT40))
or 6.85 + .32
If you actually gave this number to most people, they would have no idea what it meant. The proper way to
communicate this information is:
“Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is
estimated to be between 7.17 and 6.53”.
This means that the probability that the “true” mean job satisfaction of all employees, which is unknown, falls between
7.17 and 6.53 is 95%. It also means that there is a 5% probability that the true mean job satisfaction is outside of this range
(< 6.53 or > 7.17).
What you really need to know When calculating confidence intervals, use a 95% default unless you know something
about the decision maker. If the decision maker is conservative, use a 99% interval. If the decision maker is risk tolerant,
use a 90% interval. To increase both confidence and decrease the margin of error, increase the sample size.
51
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Explanatory and response variables
The main objective of multivariate analysis is to assess the relationship between two or more variables. A common type of
relationship that we examine in statistics is the cause-effect relationship. The variables play two different roles in this
relationship – the explanatory role and response role. The response variable is the outcome of interest that is being
researched. The explanatory variable is hypothesized to explain or influence the response variable. For example, research
studies investigating lung cancer often specify survival status (whether an individual is alive after 20 years) as the
response variable and smoking status (whether an individual used smoking tobacco and, if so, what amount) as the
explanatory variable.
There are specific locations that are traditionally designated for the explanatory and response variables in the analysis
methods we’ve discussed. The following table summarizes the proper locations of these variables for each of these
analyses.
Method of Analysis
Location of Explanatory Variable
Location of Response Variable
Stratified Analysis
1 or more columns
Rows
Stratified Confidence Intervals
1 or more columns
Rows
Contingency Table
Rows
Columns
Grouped Histogram
Different Panels
X-axis
Stacked Bar Charts
Side-by-Side Boxplots
Bars
X-axis
Stacks
Y-axis
Scatterplot
X-axis
Y-axis
52
Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Download