Data Management: E-STAT Lesson Overview Intro: Sources of data Choosing a question Learning Goal of this lesson: -to help you choose a topic for which data exists and is accessible to you -to help you find the data to support your project -to teach you how to access quality information from e-stat (our Statistics Canada paid database) (Statscan is the public face of our government statistics bureau) 1. Statistics in the News http://google.com Search for statscan news http://www.thestar.com/business/tech_news/2013/11/01/statscan_data_points_to_canadas_gro wing_digital_divide_geist.html http://www.sunnewsnetwork.ca/sunnews/canada/archives/2013/10/20131008-102228.html http://ca.news.yahoo.com/adding-up-the-ways-we%E2%80%99re-falling-behind-in-education194540699.html http://www.nationalpost.com/m/wp/sports/mlb/blog.html?b=sports.nationalpost.com/2013/0 9/24/more-than-two-thirds-of-quebecers-want-major-league-baseball-back-in-montrealaccording-to-poll Statscan twitter 2. What are statistics used for? Medical studies; Epidemiology (tracking disease outbreaks) Merchandising (tracking consumer purchases) Government policy Personal decisions – stocks, investments, productivity Sports http://www.tsn.ca/columnists/scott_cullen/?id=267960 A move towards “evidence-based decision making” 3. Good graphs; Bad graphs Animating data to produce “information” http://www.youtube.com/watch?v=jbkSRLYSojo How to Make Data Look Sexy (CNN) Are women bad a math? http://www.slate.com/blogs/xx_factor/2013/08/29/are_women_bad_at_math_graphs_r efute.html Bad graphs http://misterguch.brinkster.net/graph.html http://gator.gatewayk12.org/~smcgrail/myweb/powerpoint/misleading_graphs/here_ar e_some_examples_of_mislea.htm http://en.wikipedia.org/wiki/Misleading_graph Bad graphs http://lilt.ilstu.edu/gmklass/pos138/datadisplay/images/phillips2.jpg Three elements of bad graph design: Data Ambiguity, Data Distortion, and Data Distraction. http://lilt.ilstu.edu/gmklass/pos138/datadisplay/badchart.htm Types of graphs http://cas.illinoisstate.edu/jpda/charting_data/fillesbysection.shtml Scatterplots (correlation vs causation) 4. How do I find a topic? Where to get ideas for your project: 1. Do a literature search using the databases to see What relationships are out there What data has been collected already 2. Browse StatsCan to see what categories may be of interest to you. Check out the guides and lesson plans too. 3. Other online searches (include the term statistics or MDM 4U) http://mathforum.org/workshops/usi/dataproject/usi.hslessons.html MDM4U resources on the Brock University website http://www.brocku.ca/cmt/mdm4u/intro.htm http://www.brocku.ca/cmt/mdm4u/resr/index.html 4. Brainstorm Exemplars 5. http://schools.hwdsb.on.ca/highland/files/2011/02/MDM4U-Final-Project-PGA.pdf http://teacherweb.com/ON/statistics/Math/photo3.stm Proposal Phase: 1. Do some presearching; find background information on the topic; determine what is already known in the field 2. Thesis – your main thesis question/statement and the sub-problems you are going to answer. 3. Determine the Population you will seek out or the Sample you will use 4. Analyze – Explain each of the following: 1. What are the main variables in your question? 2. Can these variables be measured statistically? 3. Is there enough data to make an interesting analysis? 5. Hypothesis – Predict what do you expect to find / observe? 6. Why is it important for you to investigate this topic? Who is it most relevant to? 7. Data – Include either 1) all of the raw data that you are going to use from the Internet, books, etc., sourced. NOTE: For large datasets a 1-page sample including a WWW link (with Name/Title) to the rest is sufficient. I need to know how the researchers got their data. OR 2) the survey that you are going to use. It should not be distributed yet. HINT: Start your Bibliography as soon as you find your first useful web site. Trying to go back and find information later is a nightmare. 5. Research and Project Design 1. Title Page 2. Table of Contents - Include section headings and page numbers - NOT numbered 3. Summary (like an abstract) - Do not write this until you are finished your project! - Page numbering starts here (1) – insert a section break - In one page, briefly summarize your entire report. - A summary section is something that would be read by a manager who didn’t have enough time to read the entire report, so make sure that you have enough details that it can stand by itself. - At the very least, include the following information: - Problem: A clear statement of what you are trying to learn - Plan: The procedure you will use to carry out the study (How do you choose people? How do you measure? Who does the measuring? What methods are you going to use?) - Data: The data are collected according to the plan (What data did you collect? Where did it come from?) - Analysis: The data are summarized and analyzed to answer the thesis question (numerical, graphical, informative sentences) - Conclusions are drawn about what has been learned (note any biases, suggest further studies) 4. Problem Main thesis question. The thesis question is the theme of your report (e.g. What is the relationship between an NBA player’s salary and their success?). Try to use the word “relationship” in your thesis question. Remember, you do not have the tools to try and find any cause and effect. Sub-questions: The sub-questions are the smaller questions that you will answer that will lead you to conclude on your main thesis question. These should be specific enough that they contain your variables that you will compare. The problems may evolve slightly throughout the life of your project. (e.g. What is the relationship between salary and a player’s points per game? What is the relationship between salary and a player’s rebounds per game? What is the relationship between salary and the number of games that a player has won?) Hypothesis – What do you expect to find? Define the population and describe the characteristics of the population (e.g. all players in the NBA that played at least 70 games during the 2011-12 regular season). Define the independent variables (e.g. points per game in 2011-12 NBA regular season, rebounds per game in 2011-12 NBA regular season) Define the dependent variables (e.g. player salary). 5. Plan Select the sampling method and justify your choice Design and explain the Experiment/Survey/Questionnaire/Data Collection process. Identify any possible biases NOTE: if the data is not your own, you need to find out as much of the above information as possible and point out the parts that you don’t know. 6. Data Put all of your raw data collected in an appendix, not in this section Include summaries of your key variables here (frequency tables – but not histograms or graphs) Identify all problems you ran into with your data (Did you need to ‘massage’ it to use it in Excel/Fathom? Did you alter the scale?) 7. Analysis For each sub-question identified, use the concepts we learned in class to describe the data or find trends/relationships. Only include those that are relevant. (a) Numerical Statistics (your report must include at least 3) Find means, modes, and medians Find the standard deviation, Q1 , Q3 , IQR, percentiles Use linear regression and find the correlation coefficient, equation of a line of best fit Use non-linear regression and find the coefficient of determination, equation of a curve of best fit Relate your data to the Normal Distribution, Binomial Distribution or another distribution. Use z-scores and z-tables to find some useful information. Permutations, Combinations and Probability: - Predict the probability of certain events using your model - Do something else relating to probability - Use a simulation to help you discover a probability - Use the binomial theorem - Create a probability distribution (b) Graphical Representations (you must include at least 3) Scatter plots (this should be included in every project as you will be finding many relationships) Bar graph / histogram / frequency polygon (histogram + curve) / cumulative frequency polygon (each freq. is a cumulative total) / relative frequency polygon (freq. as a %) / line graph / moving average Box and whisker (c) Information – descriptive sentences. This part is very important and often overlooked by students. Don’t just provide numbers and statistics. Be sure to interpret them for the reader. What do the numbers tell you? Include this with each concept / graph. 8. Conclusion Draw conclusions that directly relate to your thesis. Note any biases that you believe occurred in your study. Make suggestions for further/follow-up studies or any modifications that would make to the current study. 9. Bibliography Web sites cited using APA format. Research Cycle: Steps to Success Maximizing Evidence and Data Sheets (from HWDSB eBEST) 1) Develop Your Question Identify the area of interest (issue, concern, untested hypotheses, unanswered question, etc.). Define in clear, specific terms, the actual, specific problems that will be the focus of your investigation. Identify the variables you are interested in. What (I.V.) causes what (D.V.)? Identify the independent and dependent variables. 2) Gather Existing Evidence/Data on Your Topic You may conduct a literature search related to the problem area. You may examine any available data/information. 3) Make a Prediction Based on the literature that you have read, what is your hypothesis? The hypothesis should engage the two variables…change in the I.V. affects the D.V. in some way. poverty rises, then rates of attending university will fall. Be specific and attempt an explanation of why you think the hypothesis is true. As 4) Make a Study/Evaluation Plan Will you seek out existing data and repurpose it or create your own through a survey? If you are using existing data…you need to Evaluate the quality of the source of your data Determine how it was collected Was it linked to by others? See if other sources support this source. This is done by adding “link:” to the beginning of the URL (e.g., type the following into a Google search window… link:http://www.hwdsb.on.ca) How was the data collected? Is the data for the two variables related? If you are studying the hypothesis above you cannot get poverty statistics from Alberta and university attendance rates from Ontario. If you were studying teen sleep and Grade 12 averages, you must use data that was collected from the same students for whom you will gather grade data. If you do a survey…you need to determine Who are the participants? What data/information will you collect? Are there any ethical issues to consider? Where will the data collection take place? When do you plan to collect the data? Once? More than once? How will you collect the data? What tools/instruments will you use? Do you plan to include an intervention? What type of analysis will you use to make sense of the findings? 5) Collect Data/Gather the Information Put your study/evaluation plan in place! Watch for confounds 6) Examine/Explore/Analyze Data Identify and describe the key findings (look for trends, graph the data, examine other relationships). Graphs should be made of summary data only. Graphs should always show the I.V. on the horizontal or x-axis and the D.V. on the vertical or y-axis. Graphs should be appropriate to the type of data (bar graphs for comparison of non-continuous information and line graphs for continuous data) o Continuous data is data for which values exist at all levels (e.g. time) o Non-continuous data is data such as eye colour (either blue or brown or hazel or other categories you create) Some important considerations: o What were the most important findings? And why are they so important? o To whom are the findings most relevant? o What are the limitations of the research? Potential Problems: Almost EVERY problem that I have seen on final projects was because of an incomplete or poorly done proposal phase. The following factors have created flawed projects in past Data Management courses: Projects were far too large in scope. A research team of 100 working for 25 years would be unable to prove causation in the way that these students wished to do. This happens most often with projects like drunk driving, teenage pregnancy or economic problems. Choose less glamorous and smaller topics that you can find data about. Make sure data is available. Projects which attempted to prove causation instead of correlation. Projects whose entire body of evidence was based on the unreliable sources from the Internet. They made no attempt to figure out where their sources' data came from. Projects where random sampling involved giving a survey to everyone in their class. Sample size too small or too homogeneous. Projects where the students developed their surveys first and their research questions second. They ended up not asking the correct survey questions and were unable to prove their point. 6. Variables Decide on variables Developing a Good Research Question Identify the variables you are going to study. Your preliminary research should allow you to develop an hypothesis to relate two variables. One is called the independent variable and the other is a dependent variable. Your hypothesis should be phrased such that you are studying the effect of the independent variable on the other or dependent variable. You should attempt to find data sets that control any other variables. E.g. Do the number of hours children watch television per day affect levels of obesity? Independent variable: Dependent variable: Controls: Independent variable – The variable you think affects something else. You select different levels for this variable and then see how this affects another variable. You have to have an idea that the variables are related…that one causes the other to change, that they are correlated. Dependent variable This is the variable you would measure. Controls: the standard to which you compare your results to ( some need to be controlled (e.g. age/sex) Relate them using an hypothesis: Does A (I.V.) affect B (D.V.)? Be specific. Define A and B fully. 7. Where can you find data? First you will need to decide what type of data you will collect for your project: 1. Primary Data is information that you collect on your own. For example, this could be obtained by having students at AHS complete a questionnaire on paper or online (using Fathom, surveymonkey.com, etc.). It could also be an actual experiment/simulation that you conduct using a computer. 2. Secondary Data is information that you are taking from another source. It is important to use reliable sources. When choosing your topic, be sure that you will be able to find good data. Some places that students often obtain data from are as follows: Let’s go on a bit of a tour… http://www.statcan.gc.ca/start-debut-eng.html Home login go to the virtual library…Virtual tools….estat University of Michigan Library of Statistics http://www.lib.umich.edu/govdocs/stats.html Nation Master www.nationmaster.com http://cas.illinoisstate.edu/jpda/finding_data/internationaldata.shtml http://ontario.compareschoolrankings.org/secondary/SchoolsByRankLocationName.aspx Use search terms like….data sets; statistics; Wolfram Alpha http://www.wolframalpha.com/ google trends (under categories; shows you the #of Google searches) http://www.google.com/trends/ RDC http://www.statcan.gc.ca/rdc-cdr/data-donnee-eng.htm Open data project http://www.data.gc.ca/default.asp?lang=En&n=F9B7A1E3-1 UN data http://data.un.org/ http://data.un.org/Search.aspx?q=population Datamob http://datamob.org/datasets Areas of data CDC http://www.cdc.gov/nchs/fastats/Default.htm http://www.cdc.gov/nchs/products/hestats.htm Health http://www.nlm.nih.gov/hsrinfo/datasites.html http://phpartners.org/health_stats.html http://web4.uwindsor.ca/units/leddy/leddy.nsf/HealthStatistics!OpenForm (list of Canadian sites for health info) World Bank - Economic Data http://data.worldbank.org/ http://data.worldbank.org/data-catalog Open Data Hamilton, ON http://openhamilton.ca/article/starting-hamiltons-open-data-sets Freebase -gives info and data http://www.freebase.com/ Data.gov http://www.data.gov/ List of open datasets http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public Library and Archives Canada http://www.collectionscanada.gc.ca/opendata/900-1000-e.html Open Government http://open.gc.ca/index-eng.asp Canadian data http://data.gc.ca/eng Health: -- Teen Drug Use and Abuse (Jeff Madeiros and Corey Hoekstra, St. Peter H.S., 2008) -- Obesity and Diabetes: A Growing Epidemic (Jennifer Brierley, St. Pius X H.S., 2006) -- Teen Pregnancy and Abortion (Nicole Sanger and Brittany Burg, St. Peter H.S., 2006) -- Canada's Health by Region (presentation, Colin McClenaghan and Jack Wei, K.C.V.I., 2006) -- Dietary Habits of Canadians (Bronwyn, Hillcrest H.S., 2006) -- Factors affecting Thyroid Condition (presentation, Shafaq, Nepean H.S., 2004) -- Does the Marital Status of Parents affect their Kids? (presentation, Cassandra, Brandon, Victoria, Derek, Mother Teresa HS, Ottawa, Dec. 2008) Education: -- Video game usage and Absenteeism in Canadian high schools (Bryan Smith, Orangeville D.S.S., 2008) -- Education, Salary, and Career Paths (Patrick Jackson, North Dundas S.S., 2006) -- Factors affecting Student Achievement (Lisa Hoople, Sacred Heart C.H.S., 2006) -- Substance Use and Academic Performance (Charlie Berrigan, Smiths Falls C.I., 2006) Politics and Economics: -- Variables affecting Voter Turnout in Canadian Federal Elections (Shaun Banke, Holy Trinity H.S., 2006) -- Political Opinions of Students (presentation, J. Maier and S. Gordon, St. Peter H.S., 2006) -- Political Opinions of Students (dataset, J. Maier and S. Gordon, St. Peter H.S., 2006) -- Analysis of the Canadian Federal Debt using E-STAT (Linda, Pickering H.S., 2004) -- Factors affecting Income (Jodi Morden and Mike Curridor, Sacred Heart, 2003) Transportation: -- Factors that Influence Collisions (Dimitar Hristov, Vesko Avramov, and Ayman Barri, Brookfield H.S., 2006) -- Why do Young People pay more for Auto Insurance? (T.J., Opeongo H. S., 2005) -- Teenager Driving Infractions (Erin Knox and Katherine Renner, Sacred Heart C.H.S., 2006) Environment and Energy: -- Global Warming (Renpeng Sun, Nepean H.S, 2008) -- Greenhouse Gas Emissions (Sarah Deslippe, K.C.V.I., Kingston, 2006) -- Carbon Dioxide Emissions (Mathew Hall, Dr. Williams S.S., 2006) -- The Future of Electricity (Jonathan Thomas, Opeongo H.S., 2006) -- Trends in Energy Consumption in Canada (Chenyu Bing, Sir Robert Borden H.S., 2006) Other: -- Unemployment & Divorce in Canada (Rachel Wang, Glebe C.I., 2008) -- The Effect of Tourism on GDP in Canada (Feifei C., L�Amoreaux C.I., 2007) -- Travellers in Canada (Gosia, Sacred Heart C.H.S., 2005) -- Investigation on the effects of Data Manipulation (Matt, Will, and James, Carleton Place H.S., 2005) -- Factors affecting Internet Use (Bryan W., Earl of March S.S., 2004)