File - Math-Statistics

advertisement
EDL 726: Introduction To Statistical Concepts and Visual Display
Section 1: Introduction to SPSS and the Generation of Visual Strategies to Convey Statistical Data
For the next three weekends, you will receive several step-by-step tutorials that will guide you through the concepts
for specific types of analyses used for gaining an understanding of how to produce, analyze, and display statistics in a
meaningful and informative way. The section for this week will contain 2 parts. The first part will include a basic
introductory tour of SPSS and will include a description of the start up procedure, descriptions and explanations of the
windows you will encounter while using SPSS, as well as data entry. Once this is preliminary lesson is completed, I will
guide students through the procedures involved in generating the most common visual representations of data.
Part 1: To Begin SPSS
• Step #1: Activate the program by double clicking on the SPSS icon. If you have a previous version of SPSS
(PASW) the screens will be very similar to the ones used in-this tutorial.
• Once SPSS has started it may go directly to what looks like a spreadsheet or you may encounter a
beginning screen that asks you to check an option concerning your intentions. We will be choosing
the “Type in data” option if this beginning screen appears.
This will open a window and will look very similar to an excel spreadsheet.
•
Step #2: Once you click on the “Type in Data” and click “OK”
• A blank spreadsheet screen will appear (see figure).
• SPSS utilizes two windows and are designated as Data View and Variable View (see blue arrows in
figure). You can switch between these two windows by clicking on the tabs in the lower left (PC) or
middle (Mac).
Variable View: This is where the variables are entered and defined
Data View: This is where the raw data are entered
•
Step #3: Click on the “Variable View”.
• This will take you to the screen that allows you to enter descriptive information about data to be
entered
• Near the top of the page you will see a number of column headings. We are only going to concern
ourselves with the columns labeled “Name”, “Label”, “Values”, and “Measure” at this time
Name: The name you assign to a variable
Label: A description of the variable (optional)
Value: This will be discussed later
Measure: This indicates the type of measurement scale those data are from
• All other column heading should be left at their defaults (the exception is that I
typically have the decimals column set to 2 places which may or may not be the
default)
•
Step #4: Data Entry - believe it or not that concludes our SPSS start up tutorial – we will next focus on entering
our first data set. Go back to the spreadsheet view by clicking the “Data View” tab and click on the first empty
cell in the upper left cell
• Start entering data from the top down as shown in the figure below. I will be using a sample data
measuring student test scores for a mathematics placement exam (see figure).
Notice that when you start typing in those data, the column heading changes to “VAR0001”. We will rename this
variable in the next step
•
Step #5: Once the data are entered as shown in Step 4, click on the “Variable View” Tab. This will allow us to
define the properties of our variable.
• Once you enter the variable view window, you will notice that the top row has been filled in
automatically and the default variable name VAR0001 had been entered. We are going to change
that to the more descriptive – Math_Placement
• Click in the cell containing the name VAR0001 and change it to Math_Placement. Be careful because
sometimes this step is a little trickier than it should be. SPSS does not allow spaces in variable names
or numbers
• All other values should look like the figure below. Typically, the only other thing that you will need to
change is the “Measure” which should be set to the type of data that is recorded – in this case, it is
scale data because Systolic Blood Pressure is continuous data. Click in the box to change the
measurement scale.
•
•
For this demonstration we do not need to worry about the “Label” column or the “Values” column as
Math_Placement is descriptive enough and since we only have one variable (and only one
measurement per patient) the “Values” column does not apply here.
Step #6: Click on the “Data View” tab and you will return to the spreadsheet view and you are now ready to
generate some informative graphs/charts that will aid in the description of your data. Notice that the column
heading in the spreadsheet view reflects the name you assigned in the previous step.
Part II: Generating Various Graphical Output from a Data Set
After visually scanning our data set, there are certain things that we can conclude. For example, we can easily see the
highest and lowest systolic blood pressure values, we can also see that most values are in the 120s, but if this data set
were much larger, as they typically would be, it would be hard to draw any meaningful inferences based on visual
inspection alone. Because of this, we need to summarize those data and explain the data set in a way that can be easily
interpreted by others. There are essentially two ways to do this – through the use of summary statistics and by visual
representation. Both of these strategies attempt to break those data down into an easy to understand way.
One of the first steps done in any analysis is to generate the basic statistics including the measures of central tendency
(mean, median and mode) as well as the standard deviation (more about the meaning of this later). Other statistics
that are of use include the range, interquartile range, frequencies, etc. We will not be looking at all of the output that
SPSS can generate but instead will be focusing on a few important numbers that will appear frequently throughout the
semester.
Let’s start off by starting SPSS and entering the same data that we used in our example concerning systolic blood
pressure for 20 individuals. If you still have that data set open, that’s great. If not, follow the directions already given
and get to the point where your data are entered and the variable has been designated as Math_Placement and follow
these steps:
•
Step #1: Assuming that you have completed entering those data for the 20 individuals, choose “Analyze” from
the main drop-down menu and choose “Frequencies”.
• You will now be presented with a dialogue box similar to the one shown on the left.
•
•
Click on the SYS_BP variable in the window on the left and click on the little blue arrow. This will
move the variable Math_Placement over to the “Variable(s)” window and is now letting SPSS know
what variables you are interested in summarizing. The following figure shows what this dialogue box
should look like
Step #2: It is now time to instruct SPSS to conduct the specific statistics that we are interested in. We will do
this by clicking on the “Statistics” button. This will bring up another dialogue box with a number of descriptive
statistical options.
• For this exercise, make sure to check the boxes indicated in the figure below. After the appropriate
descriptive statistics are checked, click the “Continue” button and you will return to the previous
dialogue box.
•
Step #3: Once you have returned to the initial dialogue box, click on the “Charts” button. Again, you will be
presented with another dialogue box
• For this exercise (and for statistics reasons) make sure that the chart type you want to display is a
histogram. Histograms are for continuous data, Bar charts and Pie charts are for discrete data
(nominal data). Since blood pressure is a continuous variable, histograms are the appropriate choice.
•
•
Once you have chosen the appropriate chart, click the continue button and you will return to the
original dialogue box. Simply click on the “OK” button and SPSS will perform the requested
operations.
Step #4: SPSS includes two separate windows. One window includes the data view as well as the variable view.
The other window becomes important once you have SPSS perform and analysis. This window is called the
output window and of course contains all of your output (or analyses). Access your output window (called
“output 1”) through the window drop-down menu
•
You should have three pieces of output – 2 tables and a histogram. The output should look like this:
Congratulations, you have successfully entered data into SPSS and instructed SPSS to run specific descriptive
statistics as well as an appropriate visual representation of those data. Almost all operations in SPSS follow
these same concepts – lots of dialogue boxes and lots of choices however, there is another very important
task to perform – that of interpretation.
We will now look at the individual tables as well as the individual values within each table and describe what
they mean. Once a person understands the steps involved in running an analysis on SPSS it becomes a
simple operation, but many people, even professionals, have trouble understanding the importance and
relevance of output in its totality. I will wait to explain some of the terms until we have a bit more
background in statistics, but we can begin with some of the more basic terms. These descriptive statistics
must be understood in order to interpret some of the statistical analyses we will be conducting later.
Our first table of data contains most of the descriptive statistics that we are interested in. In SPSS there are a
number of ways to get the same data and I only showed you this method (another way to get this is to go to
Analyze – Descriptive statistics and follow the options). I encourage all students to “play around” with other
methods of producing data. Let’s take a look at the first table in some detail:
This indicates the number of blood pressure
measurements taken.
The mean is the average of all the
measurements. Even though we calculated all
three measures of central tendency, the mean is
the appropriate measure for continuous data
Standard deviation and variance are measures of
dispersion and will be discussed later. If we take
the square root of variance, we will get standard
deviation.
The range is simply the distance from the
smallest measurement to the highest
The second table of data that was produced is essentially a frequency distribution table. This output is
especially valuable when summarizing data in terms of individual scores:
Frequency: The number of times each individual
score occurred
Percent: Percent is calculated as the frequency
divided by the total number of scores: f/n for
score with a value of 118 the equation would be
1/20 = .05 or 5%
Cumulative frequency: the running total
frequency of scores starting from the smallest
The last couple of outputs we will talk about will include the box plot, and the stem and leaf. The box plot is
really an extension of the percentiles that were listed in the descriptives table (I left out the description of
those because it is more useful when looking at a boxplot. In order to produce a box plot as well as a stem
and leaf chart follow the steps below:
•
Step #1: Choose “Analyze” from the main drop-down menu and choose “Explore” from the fly-out menu.
• You will again be presented with the dialogue box on the left (it looks a lot like some of the other
dialogue boxes) and you should make sure that the “Both” button is selected under the “Display”
heading. Then follow the instructions as before and select the variable that you are interested in and
move it over to the “Dependent List” window using the associated arrow.
•
Step #2: From the dialogue box on the right click “Plots”. This will bring up a dialogue box that will offer allow
you to choose which plots are appropriate for your data.
• At this point we will not be concerned with the “Statistics” button in the dialogue box on the right –
we will just accept the defaults (in actuality, the default setting will produce many of the same
descriptive statistics that we already have).
• Make sure that the “Factor levels together” radio button is selected as well as the “Stem-and-leaf”
checkbox is selected. The “Normality plots with tests” checkbox may be selected by default – that is
ok, but we will address that output at a later time.
•
Click on “Continue” and when you get to the previous dialogue box, click on “OK”. SPSS will now
produce the chosen output.
You should now access the output window as we have done before. SPSS appends all analyses at the end of
the file so you may need to scroll down quite a bit to find you requested output. Don’t worry if there seems
to be a lot more output then you expected. At this point we will only be reviewing the box-plot and the
stem-and-leaf output.
Scan through you SPSS output and you should come across the following plot. This is called a box-plot and
really consists of 5 parts.
This line, often called the upper whisker, represents
the largest value in our data set (SYS_BP largest value
= 125)
This line represents the 75th percentile of scores. This
value should match the 75th percentile score shown in
our descriptive statistics table
This heavy line represents the median or the 50th
percentile score and should match both the median
and 50th percentile score shown in descriptive table
This heavy line represents the 25th percentile score
and should match the 50th percentile score shown in
descriptive table
This line, often called the lower whisker, represents
the lowest value in our dataset (SYS_BP lowest value
= 118.
There is a lot of information that can be understood looking at a box-plot. For example, we know what the
range of our data is simply by looking at the high and low whiskers. We can see how the middle 50% of the
scores are clustered by the size of the box. We can also get an idea of the shape of the distribution by
examining the location of the middle 50% of the scores relative to the whiskers. For example, if the box is
directly in the middle of the whiskers and the heavy line (median) is in the center of the box, the distribution
would be perfectly symmetrical. If the entire box was pushed toward one of the whiskers, then we would
have a skewed (asymmetrical) distribution.
To some extent we can see this distribution by examining our last output – the stem-and-leaf plot. The stemand-leaf plot simply converts our frequency distribution table into stems (the values obtained for
Math_Placement) and leafs (the frequencies of each value indicated by “0”s. The result is a kind of
rudimentary depiction of our distribution. Note how this relates to the box-plot. And to the histogram
produced earlier.
There is an additional concept to the box-plot that should be addressed here. That is the concept of extreme
scores or outliers. In our example of Math_Placement scores, there were no outliers and therefore our boxplot was very simply – indicating the range of scores with the upper and lower whiskers. If the data set did
include some extreme scores, the output would be a bit different. The box-plot would then list the scores
starting from the 5th percentile through the 95th percentile and consider these “normal” values. The other
values consisting of scores lying outside the whiskers would be plotted as points. In order to graphically
represent this in a simple way, I will add 6 scores to our data set that would be considered extreme scores.
The figure below represents our new data set including the original 20 measure plus the 6 extreme scores.
Notice that my added scores are now outside the whiskers and are shown as points. This is done so the
researcher can flag these to see if they are truly outliers and if they should be eliminated from the analysis.
Notice (if you can see it) that the range of the box-plot is unchanged by looking at the values of the
whiskers. However, the location of the box as well as the heavy line indicating the median (50th percentile
score) have changed because the values that I added are still taken into account when determining the
percentile rank of all scores.
Download