Heather Johnston - admn502a2010a01

advertisement
Week 11: Basic Descriptive Quantitative Data Analysis
November 16, 2010
Lecture Notes Prepared by Heather Johnston
Introduction
Today’s class was in the computer lab, HSDA150. Prof Tedds introduced the class to
using MS Excel to produce charts, graphs and descriptive statistics.
Note: the power point slides from Prof. Tedd’s lecture are available on Moodle under
Week 11 and are titled: Intro to Quant Analysis Lecture.
Housekeeping
A form was circulated for students to indicate which 598 they would be critiquing.
Prof. Brady has special office hours next week to meet with students to discuss their
598 critiques. Some students have already done this and everyone is encouraged to do
this.
Prof. Tedds instructed the class to take comments and feedback about writing seriously,
and to take advantage of the resources that are available for students at the university, i.e.
the Writing Centre. Comments from the federal Treasury Board indicate that writing
skills are one of the biggest weaknesses of new public servants. Consider time spent on
improving writing as professional development!
From now on the wikis from weeks 1-10 are common pages and students may make
changes where necessary, and Prof. Brady will also do this.
For last week’s wiki the issues from the previous week were reiterated. Remember these
wikis are for our future use (during 598s!) so they should not assume too much retained
knowledge and each should be a stand-alone document. Wikis should include examples
from class and links to readings.
*
The Scoping Review Drafts are due to our clients tomorrow at noon! Don’t
forget to cc Prof. Tedds, Prof. Brady and KIS (or PICS). The goal of these drafts is for
them to be the final versions, so do lots of editing. Good luck everyone!
Workshop on Descriptive Quantitative Analysis with Prof. Tedds
Purpose
To give a brief introduction on how to use MS Excel to produce charts, graphs and
summary statistics to display data in a meaningful way for supporting arguments.
See Manheim for the theory of descriptive quantitative analysis:
 http://moodle.uvic.ca/file.php/14825/Week_11/Week_11_Readings/Manheim_Ch
ap_16_Summarizing_one_variable.pdf
1

http://moodle.uvic.ca/file.php/14825/Week_11/Week_11_Readings/Manheim_Ch
apter_15_Desc_Data.pdf
*It is not the statistics themselves that lie. People misuse statistics to bolster weak
arguments.
*Prof. Tedds is concerned she is teaching us enough to manipulate the statistics in MS
Excel, but not enough to know how to use statistics properly. More to come in 502B!
Introduction to Quantitative Measures
Without data, it’s just opinion, so, what we think we know is not actually supported by
evidence.
VIDEO Hans Rosling, part 1 – developmental economics, does your mindset correspond
with the dataset?
Ex) developing world vs. western world, how do you define the two, developing world
has large families and short lives and the western world has small families and long lives,
however data shows that overtime this has changed because of international development
interventions on family planning etc. now the developing world lives to 60 years and has
families with only 2 children, so the mindset is shown to be antiquated by the data.
VIDEO Hans Rosling, part 2 – graphing skills to compare datasets, graphs can be
informative and fun!
NOTE – getting buy-in for statistics is easy when it matches what people already know; it
takes more work to convince people their mindset is wrong
How to Design tables, graphs and summary statistics:
They can be used to deceive the naïve. Prof Tedds says “The design is the
communication”.
2
Figure 1: Prof. Tedds PowerPoint slides, p.3
In Research Process – on Figure 1, we are at the ‘?’. Basic descriptive statistics (502B
goes more in depth) make sense out of the numbers by summarizing the dataset, how we
chose to display the data is dependent on the level of measurement.
Excel Tutorials
Describing, summarizing and displaying data: Tables
-In MS Excel a frequency distribution table or list is referred to as pivot table
-they are a convenient way to summarize tabular data
-for nominal or ordinal level data
-contains class groupings, or categories of data collected
Ex) percentage of sales reported to tax authority, so start by categories (responded to in
the survey, not a continuous measure, this is ordinal data), count the observations
(frequency), but might be more informative to express frequency as a percentage of the
column (depends on data)
*Must Have = title, data set must be sourced, explanatory note about what’s not included
in the table
VIDEO – how to create frequency distribution table in MS Excel
http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w
idth=640&height=480&file=PADM/ExcelTutorial1_Frequencytable.mp4&firstIm
g=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repeat
=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcetyp
e=vod&streamer=rtmp%3A//vod.uvic.ca/vod/
- add in another column variable, still at the nominal or ordinal level, a contingency
table is used to show the relationship between two variables, row (dependent variable)
and column (independent variable).
-how do you want to display values in the table, counts are hard to read, easier to
understand percentages by column
-notice Excel builds step by step, using the same steps to produce increasingly complex
outputs
*Must have = simple name for title, include the major variables, explanatory notes for
what is not included, easy to read, formatted to APA standards
VIDEO – how to create a contingency table in MS Excel
http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w
idth=640&height=480&file=PADM/ExcelTutorial2_ContingencyTable.mp4&first
Img=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repe
at=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcet
ype=vod&streamer=rtmp%3A//vod.uvic.ca/vod/
Describing, summarizing and displaying data: Graphs using nominal or ordinal
data
-using the two tables we just created to make graphs
NOTE – graphs in 598 have mostly been bad; remember they should be clean and simple
3
-a bar graph is used only for categorical (nominal or ordinal) data
-to draw attention to frequency of categories, it makes it pop more than a table, easier to
process the information
-another option is to take each column and make a line graph
VIDEO – how to create graphs in MS Excel from frequency tables (one variable):
http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w
idth=640&height=480&file=PADM/ExcelTutorial3_Onevariablegraphs.mp4&firs
tImg=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&rep
eat=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&source
type=vod&streamer=rtmp%3A//vod.uvic.ca/vod/
Prof. Tedds said include label values and marker points, do not include gridlines, she
does not like segmented bar charts, the Economist magazine likes to use descending
vertical bar charts, but it only looks good if the bars decrease evenly
-a pie graph is used to emphasize the proportion of each category as the entire ‘pie’ is
the total observation
-most quantitative researchers do not like pie graphs, think about how much ink is used,
think about how many people print in black and white and how indistinguishable
different colours would be so use gray scale differentiation, could separate the out the
pieces for ex) pull out the category you will be highlighting in the text – 100% reporting
to tax authority
VIDEO – how to create graphs from contingency tables (two variables)
http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w
idth=640&height=480&file=PADM/ExcelTutorial4_Twovariablegraphs.mp4&firs
tImg=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&rep
eat=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&source
type=vod&streamer=rtmp%3A//vod.uvic.ca/vod/
NOTE – APA requires legend be within the graph, however if the type of graph does not
allow for that the top right hand side in the white space is preferred
VIDEO – how to create a line/time series graph was not watched in class, the steps are
similar, watch at your own convenience
http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w
idth=640&height=480&file=PADM/ExcelTutorial5_timeseriesgraph.mp4&firstI
mg=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repea
t=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcety
pe=vod&streamer=rtmp%3A//vod.uvic.ca/vod/
Describing, summarizing and displaying data: Graphs using continuous data
-a histogram takes a continuous series and makes it look like an ordinal (categorical)
table, different than frequency because underlying data set is continuous data
-categories (aka bins in MS Excel) can be defined by formulas or a priori, modify the
rough bin width to be more sensible and logical
4
-losing information by transforming data to a lower level data, this diminishes quality of
data, but it is a quick, rough view of distribution of data
VIDEO – how to create a histogram in MS Excel
http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w
idth=640&height=480&file=PADM/ExcelTutorial6_Histogram.mp4&firstImg=htt
p%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repeat=list
&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcetype=vod
&streamer=rtmp%3A//vod.uvic.ca/vod/
-a scatter graph is a visual representation of the relationship between two continuous
variables, created from a contingency, an inverse angle curve shows a positive and
moderate relationship between the variables
VIDEO – how to create a scatter graph in MS Excel, not watched in class, watch on your
own!
http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w
idth=640&height=480&file=PADM/ExcelTutorial7_Scatter.mp4&firstImg=http%
3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repeat=list&s
huffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcetype=vod&s
treamer=rtmp%3A//vod.uvic.ca/vod/
How to Present Graphs






balance substance and design
proportion and balance, simplicity of design and complexity of data
clear and efficient make sure it shows what you want
efficiency means least ink and smallest space
tell the truth, have a story and make sure your graph tells it
avoid chart junk i.e. a bunch of labels, crazy vibrations and extra graphics, keep
graphs simple, plain and white
Displaying Data: “Mistakes” (or “Deceptions”)

-non-zero origin can exaggerate the change ex) Gordon Campbell starting the tax
graph from 1000 made the difference between BC and Ontario look like 2/3 but it
is actually only ½
 -limiting scope ex) not examining other historic recessions when looking at this
recession
 -omitting data that refutes your point ex) not reporting totals/capita
*Always be cautious when looking at graphs, ask yourself ‘what are they not telling me?’
Q-In the 598 a student is critiquing there is very little variation in the data so it is hard to
see the differences in the bar graph, how could that be improved?
A-Perhaps a different style of graph, or give a non-manipulated graph then provide
another zoomed in graph with an explanatory note, always be explicit with omissions and
limited scopes
5
Describing, summarizing and displaying data: Numerical
Central Tendencies
-different central tendency measures are applicable depending on the continuous data

Mode is the value that occurs most in the dataset, can be multiple modes or no
mode.

Median is the point in the data where 50% of the data falls below it and 50% of
the data is above it.

Mean is the simple arithmetic mean, it is the balancing point of the data (where
the fulcrum on the scale is placed), the simplified explanation of the formula is: it
is the sum of all observations divided by the number of observations.
o In Excel ‘average’ is the formula for mean
These central tendencies tell the researcher what the shape of the distribution of data looks like,
so there is no need to create a histogram:




if mean = median the distribution is symmetric (bimodal or unimodal)
if there is one mode, it is unimodal
a uniform distribution is not very common (mean = median, but no mode)
if mean<median<mode it is a skewed distribution, positive (tail to the right, more heavily
distributed at the start) ex) after tax income in Canada, negative (tail to the left, more
heavily distributed at the high end of the range) ex) University grades because of grade
inflation, median is more representative than mean if the distribution is skewed because
median is not affected by outliers in the distribution
VIDEO – how to calculate central tendencies in Excel
http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w
idth=640&height=480&file=PADM/ExcelTutorial8_centraltendency.mp4&firstI
mg=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repea
t=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcety
pe=vod&streamer=rtmp%3A//vod.uvic.ca/vod/
Dispersion
(Joke from Prof Tedds) – It was funny, you had to be there! It illustrated that the average doesn’t
tell us anything without dispersion

Range – the simplest measure of dispersion, ( largest value – smallest value), ignores all
other data points and can be sensitive to outliers

Variance –single summary measure of dispersion, accounts for all data points, however
the result is expressed in units2
6

Standard deviation – square root of variance, square root because that puts the variable
back into the data it was reported in, large standard deviation spread far from the mean =
more spread out distribution
o more to come on standard deviations in 502B!
VIDEO – how to calculate measure of dispersion in MS Excel. There is a short cut under the
‘Data Analysis’ tool – descriptive statistics
http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w
idth=640&height=480&file=PADM/ExcelTutorial9_Dispersion.mp4&firstImg=ht
tp%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repeat=lis
t&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcetype=vo
d&streamer=rtmp%3A//vod.uvic.ca/vod/
Measures of Association
-rather than a scatter diagram we can visualize it if we do a covariance measure, however units
are meaningless in this measure, this measures the strength of the linear relationship between
variables, however no causal effect is implied
-a much better measure is the sample correlation coefficient, to know the strength of the
association between variables
-the result is bound between -1 and 1 and it is a unitless measure, closer to -1 means a stronger
negative linear relationship between the variables, closer to 1 means a stronger positive linear
relationship between the variables, a perfect association means all points fall on the line, if no
relationship no line can be drawn through on an angle
VIDEO – how to calculate the correlation coefficient in Excel
Ex) a correlation coefficient of 0.6 represents a moderately strong association between income
increases and food expenditure increases.
http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w
idth=640&height=480&file=PADM/ExcelTutorial10_Correlation.mp4&firstImg=
http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repeat=l
ist&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcetype=
vod&streamer=rtmp%3A//vod.uvic.ca/vod/
Concluding Remarks



SR to client cc:ing Tedds and KIS and delivered to turn it in
No class next week, so work on your 598 critique! Due in class and to turnitin
November 30, last class!
Final SR requirements must incorporate and address all comments from Profs,
KIS/PICs/client. Must include a document that explains all the comments and
how they have been addressed, or why they were not. Groups will get a zero if no
summary document is provided. Be clear and detailed, clients want this
information as well. Professor and KIS will go through our client’s comments
together and will provide us with an amalgamation of consistent comments from
everyone.
7
Download