Week 11: Basic Descriptive Quantitative Data Analysis November 16, 2010 Lecture Notes Prepared by Heather Johnston Introduction Today’s class was in the computer lab, HSDA150. Prof Tedds introduced the class to using MS Excel to produce charts, graphs and descriptive statistics. Note: the power point slides from Prof. Tedd’s lecture are available on Moodle under Week 11 and are titled: Intro to Quant Analysis Lecture. Housekeeping A form was circulated for students to indicate which 598 they would be critiquing. Prof. Brady has special office hours next week to meet with students to discuss their 598 critiques. Some students have already done this and everyone is encouraged to do this. Prof. Tedds instructed the class to take comments and feedback about writing seriously, and to take advantage of the resources that are available for students at the university, i.e. the Writing Centre. Comments from the federal Treasury Board indicate that writing skills are one of the biggest weaknesses of new public servants. Consider time spent on improving writing as professional development! From now on the wikis from weeks 1-10 are common pages and students may make changes where necessary, and Prof. Brady will also do this. For last week’s wiki the issues from the previous week were reiterated. Remember these wikis are for our future use (during 598s!) so they should not assume too much retained knowledge and each should be a stand-alone document. Wikis should include examples from class and links to readings. * The Scoping Review Drafts are due to our clients tomorrow at noon! Don’t forget to cc Prof. Tedds, Prof. Brady and KIS (or PICS). The goal of these drafts is for them to be the final versions, so do lots of editing. Good luck everyone! Workshop on Descriptive Quantitative Analysis with Prof. Tedds Purpose To give a brief introduction on how to use MS Excel to produce charts, graphs and summary statistics to display data in a meaningful way for supporting arguments. See Manheim for the theory of descriptive quantitative analysis: http://moodle.uvic.ca/file.php/14825/Week_11/Week_11_Readings/Manheim_Ch ap_16_Summarizing_one_variable.pdf 1 http://moodle.uvic.ca/file.php/14825/Week_11/Week_11_Readings/Manheim_Ch apter_15_Desc_Data.pdf *It is not the statistics themselves that lie. People misuse statistics to bolster weak arguments. *Prof. Tedds is concerned she is teaching us enough to manipulate the statistics in MS Excel, but not enough to know how to use statistics properly. More to come in 502B! Introduction to Quantitative Measures Without data, it’s just opinion, so, what we think we know is not actually supported by evidence. VIDEO Hans Rosling, part 1 – developmental economics, does your mindset correspond with the dataset? Ex) developing world vs. western world, how do you define the two, developing world has large families and short lives and the western world has small families and long lives, however data shows that overtime this has changed because of international development interventions on family planning etc. now the developing world lives to 60 years and has families with only 2 children, so the mindset is shown to be antiquated by the data. VIDEO Hans Rosling, part 2 – graphing skills to compare datasets, graphs can be informative and fun! NOTE – getting buy-in for statistics is easy when it matches what people already know; it takes more work to convince people their mindset is wrong How to Design tables, graphs and summary statistics: They can be used to deceive the naïve. Prof Tedds says “The design is the communication”. 2 Figure 1: Prof. Tedds PowerPoint slides, p.3 In Research Process – on Figure 1, we are at the ‘?’. Basic descriptive statistics (502B goes more in depth) make sense out of the numbers by summarizing the dataset, how we chose to display the data is dependent on the level of measurement. Excel Tutorials Describing, summarizing and displaying data: Tables -In MS Excel a frequency distribution table or list is referred to as pivot table -they are a convenient way to summarize tabular data -for nominal or ordinal level data -contains class groupings, or categories of data collected Ex) percentage of sales reported to tax authority, so start by categories (responded to in the survey, not a continuous measure, this is ordinal data), count the observations (frequency), but might be more informative to express frequency as a percentage of the column (depends on data) *Must Have = title, data set must be sourced, explanatory note about what’s not included in the table VIDEO – how to create frequency distribution table in MS Excel http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w idth=640&height=480&file=PADM/ExcelTutorial1_Frequencytable.mp4&firstIm g=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repeat =list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcetyp e=vod&streamer=rtmp%3A//vod.uvic.ca/vod/ - add in another column variable, still at the nominal or ordinal level, a contingency table is used to show the relationship between two variables, row (dependent variable) and column (independent variable). -how do you want to display values in the table, counts are hard to read, easier to understand percentages by column -notice Excel builds step by step, using the same steps to produce increasingly complex outputs *Must have = simple name for title, include the major variables, explanatory notes for what is not included, easy to read, formatted to APA standards VIDEO – how to create a contingency table in MS Excel http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w idth=640&height=480&file=PADM/ExcelTutorial2_ContingencyTable.mp4&first Img=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repe at=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcet ype=vod&streamer=rtmp%3A//vod.uvic.ca/vod/ Describing, summarizing and displaying data: Graphs using nominal or ordinal data -using the two tables we just created to make graphs NOTE – graphs in 598 have mostly been bad; remember they should be clean and simple 3 -a bar graph is used only for categorical (nominal or ordinal) data -to draw attention to frequency of categories, it makes it pop more than a table, easier to process the information -another option is to take each column and make a line graph VIDEO – how to create graphs in MS Excel from frequency tables (one variable): http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w idth=640&height=480&file=PADM/ExcelTutorial3_Onevariablegraphs.mp4&firs tImg=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&rep eat=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&source type=vod&streamer=rtmp%3A//vod.uvic.ca/vod/ Prof. Tedds said include label values and marker points, do not include gridlines, she does not like segmented bar charts, the Economist magazine likes to use descending vertical bar charts, but it only looks good if the bars decrease evenly -a pie graph is used to emphasize the proportion of each category as the entire ‘pie’ is the total observation -most quantitative researchers do not like pie graphs, think about how much ink is used, think about how many people print in black and white and how indistinguishable different colours would be so use gray scale differentiation, could separate the out the pieces for ex) pull out the category you will be highlighting in the text – 100% reporting to tax authority VIDEO – how to create graphs from contingency tables (two variables) http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w idth=640&height=480&file=PADM/ExcelTutorial4_Twovariablegraphs.mp4&firs tImg=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&rep eat=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&source type=vod&streamer=rtmp%3A//vod.uvic.ca/vod/ NOTE – APA requires legend be within the graph, however if the type of graph does not allow for that the top right hand side in the white space is preferred VIDEO – how to create a line/time series graph was not watched in class, the steps are similar, watch at your own convenience http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w idth=640&height=480&file=PADM/ExcelTutorial5_timeseriesgraph.mp4&firstI mg=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repea t=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcety pe=vod&streamer=rtmp%3A//vod.uvic.ca/vod/ Describing, summarizing and displaying data: Graphs using continuous data -a histogram takes a continuous series and makes it look like an ordinal (categorical) table, different than frequency because underlying data set is continuous data -categories (aka bins in MS Excel) can be defined by formulas or a priori, modify the rough bin width to be more sensible and logical 4 -losing information by transforming data to a lower level data, this diminishes quality of data, but it is a quick, rough view of distribution of data VIDEO – how to create a histogram in MS Excel http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w idth=640&height=480&file=PADM/ExcelTutorial6_Histogram.mp4&firstImg=htt p%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repeat=list &shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcetype=vod &streamer=rtmp%3A//vod.uvic.ca/vod/ -a scatter graph is a visual representation of the relationship between two continuous variables, created from a contingency, an inverse angle curve shows a positive and moderate relationship between the variables VIDEO – how to create a scatter graph in MS Excel, not watched in class, watch on your own! http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w idth=640&height=480&file=PADM/ExcelTutorial7_Scatter.mp4&firstImg=http% 3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repeat=list&s huffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcetype=vod&s treamer=rtmp%3A//vod.uvic.ca/vod/ How to Present Graphs balance substance and design proportion and balance, simplicity of design and complexity of data clear and efficient make sure it shows what you want efficiency means least ink and smallest space tell the truth, have a story and make sure your graph tells it avoid chart junk i.e. a bunch of labels, crazy vibrations and extra graphics, keep graphs simple, plain and white Displaying Data: “Mistakes” (or “Deceptions”) -non-zero origin can exaggerate the change ex) Gordon Campbell starting the tax graph from 1000 made the difference between BC and Ontario look like 2/3 but it is actually only ½ -limiting scope ex) not examining other historic recessions when looking at this recession -omitting data that refutes your point ex) not reporting totals/capita *Always be cautious when looking at graphs, ask yourself ‘what are they not telling me?’ Q-In the 598 a student is critiquing there is very little variation in the data so it is hard to see the differences in the bar graph, how could that be improved? A-Perhaps a different style of graph, or give a non-manipulated graph then provide another zoomed in graph with an explanatory note, always be explicit with omissions and limited scopes 5 Describing, summarizing and displaying data: Numerical Central Tendencies -different central tendency measures are applicable depending on the continuous data Mode is the value that occurs most in the dataset, can be multiple modes or no mode. Median is the point in the data where 50% of the data falls below it and 50% of the data is above it. Mean is the simple arithmetic mean, it is the balancing point of the data (where the fulcrum on the scale is placed), the simplified explanation of the formula is: it is the sum of all observations divided by the number of observations. o In Excel ‘average’ is the formula for mean These central tendencies tell the researcher what the shape of the distribution of data looks like, so there is no need to create a histogram: if mean = median the distribution is symmetric (bimodal or unimodal) if there is one mode, it is unimodal a uniform distribution is not very common (mean = median, but no mode) if mean<median<mode it is a skewed distribution, positive (tail to the right, more heavily distributed at the start) ex) after tax income in Canada, negative (tail to the left, more heavily distributed at the high end of the range) ex) University grades because of grade inflation, median is more representative than mean if the distribution is skewed because median is not affected by outliers in the distribution VIDEO – how to calculate central tendencies in Excel http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w idth=640&height=480&file=PADM/ExcelTutorial8_centraltendency.mp4&firstI mg=http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repea t=list&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcety pe=vod&streamer=rtmp%3A//vod.uvic.ca/vod/ Dispersion (Joke from Prof Tedds) – It was funny, you had to be there! It illustrated that the average doesn’t tell us anything without dispersion Range – the simplest measure of dispersion, ( largest value – smallest value), ignores all other data points and can be sensitive to outliers Variance –single summary measure of dispersion, accounts for all data points, however the result is expressed in units2 6 Standard deviation – square root of variance, square root because that puts the variable back into the data it was reported in, large standard deviation spread far from the mean = more spread out distribution o more to come on standard deviations in 502B! VIDEO – how to calculate measure of dispersion in MS Excel. There is a short cut under the ‘Data Analysis’ tool – descriptive statistics http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w idth=640&height=480&file=PADM/ExcelTutorial9_Dispersion.mp4&firstImg=ht tp%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repeat=lis t&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcetype=vo d&streamer=rtmp%3A//vod.uvic.ca/vod/ Measures of Association -rather than a scatter diagram we can visualize it if we do a covariance measure, however units are meaningless in this measure, this measures the strength of the linear relationship between variables, however no causal effect is implied -a much better measure is the sample correlation coefficient, to know the strength of the association between variables -the result is bound between -1 and 1 and it is a unitless measure, closer to -1 means a stronger negative linear relationship between the variables, closer to 1 means a stronger positive linear relationship between the variables, a perfect association means all points fall on the line, if no relationship no line can be drawn through on an angle VIDEO – how to calculate the correlation coefficient in Excel Ex) a correlation coefficient of 0.6 represents a moderately strong association between income increases and food expenditure increases. http://www.uvic.ca/shared/shared_rootsite/media/video/videoplayer.php?w idth=640&height=480&file=PADM/ExcelTutorial10_Correlation.mp4&firstImg= http%3A//publicadmin.uvic.ca/images/padmlogo.jpg%20&icons=true&repeat=l ist&shuffle=false&rtmpdvr=false&encryption=off&copyright=on&sourcetype= vod&streamer=rtmp%3A//vod.uvic.ca/vod/ Concluding Remarks SR to client cc:ing Tedds and KIS and delivered to turn it in No class next week, so work on your 598 critique! Due in class and to turnitin November 30, last class! Final SR requirements must incorporate and address all comments from Profs, KIS/PICs/client. Must include a document that explains all the comments and how they have been addressed, or why they were not. Groups will get a zero if no summary document is provided. Be clear and detailed, clients want this information as well. Professor and KIS will go through our client’s comments together and will provide us with an amalgamation of consistent comments from everyone. 7