Data Management: E-STAT Lesson

advertisement
Data Management: E-STAT Lesson
Overview
Intro:
 Sources of data
 Choosing a question
Learning Goal of this lesson:
-to help you choose a topic for which data exists and is accessible to you
-to help you find the data to support your project
-to teach you how to access quality information from e-stat
(our Statistics Canada paid database)
(Statscan is the public face of our government statistics bureau)
1. Statistics in the News
http://google.com
Search for statscan news
http://www.thestar.com/business/tech_news/2013/11/01/statscan_data_points_to_canadas_gro
wing_digital_divide_geist.html
http://www.sunnewsnetwork.ca/sunnews/canada/archives/2013/10/20131008-102228.html
http://ca.news.yahoo.com/adding-up-the-ways-we%E2%80%99re-falling-behind-in-education194540699.html
http://www.nationalpost.com/m/wp/sports/mlb/blog.html?b=sports.nationalpost.com/2013/0
9/24/more-than-two-thirds-of-quebecers-want-major-league-baseball-back-in-montrealaccording-to-poll
Statscan twitter
2. What are statistics used for?
Medical studies; Epidemiology (tracking disease outbreaks)
Merchandising (tracking consumer purchases)
Government policy
Personal decisions – stocks, investments, productivity
Sports
http://www.tsn.ca/columnists/scott_cullen/?id=267960
A move towards “evidence-based decision making”
3. Good graphs; Bad graphs
Animating data to produce “information”
http://www.youtube.com/watch?v=jbkSRLYSojo
How to Make Data Look Sexy (CNN)
Are women bad a math?
http://www.slate.com/blogs/xx_factor/2013/08/29/are_women_bad_at_math_graphs_r
efute.html
Bad graphs
http://misterguch.brinkster.net/graph.html
http://gator.gatewayk12.org/~smcgrail/myweb/powerpoint/misleading_graphs/here_ar
e_some_examples_of_mislea.htm
http://en.wikipedia.org/wiki/Misleading_graph
Bad graphs
http://lilt.ilstu.edu/gmklass/pos138/datadisplay/images/phillips2.jpg
Three elements of bad graph design:
Data Ambiguity, Data Distortion, and Data Distraction.
http://lilt.ilstu.edu/gmklass/pos138/datadisplay/badchart.htm
Types of graphs
http://cas.illinoisstate.edu/jpda/charting_data/fillesbysection.shtml
Scatterplots (correlation vs causation)
4. How do I find a topic?
Where to get ideas for your project:
1. Do a literature search using the databases to see
 What relationships are out there
 What data has been collected already
2. Browse StatsCan to see what categories may be of interest to you. Check out the
guides and lesson plans too.
3. Other online searches (include the term statistics or MDM 4U)
http://mathforum.org/workshops/usi/dataproject/usi.hslessons.html
MDM4U resources on the Brock University website
http://www.brocku.ca/cmt/mdm4u/intro.htm
http://www.brocku.ca/cmt/mdm4u/resr/index.html
4. Brainstorm
Exemplars
5.
http://schools.hwdsb.on.ca/highland/files/2011/02/MDM4U-Final-Project-PGA.pdf
http://teacherweb.com/ON/statistics/Math/photo3.stm
Proposal Phase:
1. Do some presearching; find background information on the topic; determine what is
already known in the field
2. Thesis – your main thesis question/statement and the sub-problems you are going to
answer.
3. Determine the Population you will seek out or the Sample you will use
4. Analyze – Explain each of the following:
1. What are the main variables in your question?
2. Can these variables be measured statistically?
3. Is there enough data to make an interesting analysis?
5. Hypothesis – Predict what do you expect to find / observe?
6. Why is it important for you to investigate this topic? Who is it most relevant to?
7. Data – Include either 1) all of the raw data that you are going to use from the Internet,
books, etc., sourced. NOTE: For large datasets a 1-page sample including a WWW link
(with Name/Title) to the rest is sufficient. I need to know how the researchers got their
data. OR 2) the survey that you are going to use. It should not be distributed yet.
HINT: Start your Bibliography as soon as you find your first useful web site. Trying to go
back and find information later is a nightmare.
5. Research and Project Design
1. Title Page
2.
Table of Contents
- Include section headings and page numbers
- NOT numbered
3. Summary (like an abstract)
- Do not write this until you are finished your project!
- Page numbering starts here (1) – insert a section break
- In one page, briefly summarize your entire report.
- A summary section is something that would be read by a manager who didn’t have enough
time to read the entire report, so make sure that you have enough details that it can stand by
itself.
- At the very least, include the following information:
- Problem: A clear statement of what you are trying to learn
- Plan: The procedure you will use to carry out the study (How do you choose people?
How do you measure? Who does the measuring? What methods are you going to use?)
- Data: The data are collected according to the plan (What data did you collect? Where
did it come from?)
- Analysis: The data are summarized and analyzed to answer the thesis question
(numerical, graphical, informative sentences)
- Conclusions are drawn about what has been learned (note any biases, suggest further
studies)
4. Problem






Main thesis question. The thesis question is the theme of your report (e.g. What is the relationship
between an NBA player’s salary and their success?). Try to use the word “relationship” in your thesis
question. Remember, you do not have the tools to try and find any cause and effect.
Sub-questions: The sub-questions are the smaller questions that you will answer that will lead you to
conclude on your main thesis question. These should be specific enough that they contain your
variables that you will compare. The problems may evolve slightly throughout the life of your project.
(e.g. What is the relationship between salary and a player’s points per game? What is the relationship
between salary and a player’s rebounds per game? What is the relationship between salary and the
number of games that a player has won?)
Hypothesis – What do you expect to find?
Define the population and describe the characteristics of the population (e.g. all players in the NBA
that played at least 70 games during the 2011-12 regular season).
Define the independent variables (e.g. points per game in 2011-12 NBA regular season, rebounds
per game in 2011-12 NBA regular season)
Define the dependent variables (e.g. player salary).
5. Plan



Select the sampling method and justify your choice
Design and explain the Experiment/Survey/Questionnaire/Data Collection process.
Identify any possible biases
NOTE: if the data is not your own, you need to find out as much of the above information as possible and
point out the parts that you don’t know.
6. Data



Put all of your raw data collected in an appendix, not in this section
Include summaries of your key variables here (frequency tables – but not histograms or graphs)
Identify all problems you ran into with your data (Did you need to ‘massage’ it to use it in
Excel/Fathom? Did you alter the scale?)
7. Analysis
For each sub-question identified, use the concepts we learned in class to describe the data or find
trends/relationships. Only include those that are relevant.
(a) Numerical Statistics (your report must include at least 3)
 Find means, modes, and medians
 Find the standard deviation, Q1 , Q3 , IQR, percentiles





Use linear regression and find the correlation coefficient, equation of a line of best fit
Use non-linear regression and find the coefficient of determination, equation of a curve of best fit
Relate your data to the Normal Distribution, Binomial Distribution or another distribution.
Use z-scores and z-tables to find some useful information.
Permutations, Combinations and Probability:
- Predict the probability of certain events using your model
- Do something else relating to probability
- Use a simulation to help you discover a probability
- Use the binomial theorem
- Create a probability distribution
(b) Graphical Representations (you must include at least 3)
 Scatter plots (this should be included in every project as you will be finding many relationships)
 Bar graph / histogram / frequency polygon (histogram + curve) / cumulative frequency polygon
(each freq. is a cumulative total) / relative frequency polygon (freq. as a %) / line graph / moving
average
 Box and whisker
(c) Information – descriptive sentences. This part is very important and often overlooked by students.
Don’t just provide numbers and statistics. Be sure to interpret them for the reader. What do the numbers
tell you? Include this with each concept / graph.
8. Conclusion



Draw conclusions that directly relate to your thesis.
Note any biases that you believe occurred in your study.
Make suggestions for further/follow-up studies or any modifications that would make to the
current study.
9. Bibliography
Web sites cited using APA format.
Research Cycle: Steps to Success
Maximizing Evidence and Data Sheets (from HWDSB eBEST)
1) Develop Your Question




Identify the area of interest (issue, concern, untested hypotheses, unanswered question, etc.).
Define in clear, specific terms, the actual, specific problems that will be the focus of your investigation.
Identify the variables you are interested in. What (I.V.) causes what (D.V.)?
Identify the independent and dependent variables.
2) Gather Existing Evidence/Data on Your Topic
 You may conduct a literature search related to the problem area.
 You may examine any available data/information.
3) Make a Prediction
 Based on the literature that you have read, what is your hypothesis?
 The hypothesis should engage the two variables…change in the I.V. affects the D.V. in some way.
poverty rises, then rates of attending university will fall.
 Be specific and attempt an explanation of why you think the hypothesis is true.
As
4) Make a Study/Evaluation Plan
Will you seek out existing data and repurpose it or create your own through a survey?
If you are using existing data…you need to
 Evaluate the quality of the source of your data
 Determine how it was collected
 Was it linked to by others?
See if other sources support this source.
This is done by adding “link:” to the beginning of the URL (e.g., type the following into a Google search
window… link:http://www.hwdsb.on.ca)
 How was the data collected? Is the data for the two variables related?
 If you are studying the hypothesis above you cannot get poverty statistics from Alberta and
university attendance rates from Ontario.
 If you were studying teen sleep and Grade 12 averages, you must use data that was collected from
the same students for whom you will gather grade data.
If you do a survey…you need to determine








Who are the participants?
What data/information will you collect?
Are there any ethical issues to consider?
Where will the data collection take place?
When do you plan to collect the data? Once? More than once?
How will you collect the data? What tools/instruments will you use?
Do you plan to include an intervention?
What type of analysis will you use to make sense of the findings?
5) Collect Data/Gather the Information
 Put your study/evaluation plan in place!
 Watch for confounds
6) Examine/Explore/Analyze Data




Identify and describe the key findings (look for trends, graph the data, examine other relationships).
Graphs should be made of summary data only.
Graphs should always show the I.V. on the horizontal or x-axis and the D.V. on the vertical or y-axis.
Graphs should be appropriate to the type of data (bar graphs for comparison of non-continuous
information and line graphs for continuous data)
o Continuous data is data for which values exist at all levels (e.g. time)
o Non-continuous data is data such as eye colour (either blue or brown or hazel or other categories
you create)
 Some important considerations:
o What were the most important findings? And why are they so important?
o To whom are the findings most relevant?
o What are the limitations of the research?
Potential Problems:
Almost EVERY problem that I have seen on final projects was because of an incomplete or poorly done
proposal phase. The following factors have created flawed projects in past Data Management courses:





Projects were far too large in scope. A research team of 100 working for 25 years would be unable
to prove causation in the way that these students wished to do. This happens most often with
projects like drunk driving, teenage pregnancy or economic problems. Choose less glamorous and
smaller topics that you can find data about. Make sure data is available.
Projects which attempted to prove causation instead of correlation.
Projects whose entire body of evidence was based on the unreliable sources from the Internet.
They made no attempt to figure out where their sources' data came from.
Projects where random sampling involved giving a survey to everyone in their class. Sample size
too small or too homogeneous.
Projects where the students developed their surveys first and their research questions second.
They ended up not asking the correct survey questions and were unable to prove their point.
6. Variables
Decide on variables
Developing a Good Research Question
Identify the variables you are going to study. Your preliminary research should allow you to develop an
hypothesis to relate two variables. One is called the independent variable and the other is a dependent
variable. Your hypothesis should be phrased such that you are studying the effect of the independent
variable on the other or dependent variable. You should attempt to find data sets that control any other
variables.
E.g. Do the number of hours children watch television per day affect levels of obesity?
Independent variable:
Dependent variable:
Controls:
Independent variable – The variable you think affects something else. You select
different levels for this variable and then see how this affects another variable.
You have to have an idea that the variables are related…that one causes the other to
change, that they are correlated.
Dependent variable This is the variable you would measure.
Controls: the standard to which you compare your results to ( some need to be
controlled (e.g. age/sex)
Relate them using an hypothesis:
Does A (I.V.) affect B (D.V.)? Be specific.
Define A and B fully.
7. Where can you find data?
First you will need to decide what type of data you will collect for your project:
1. Primary Data is information that you collect on your own. For example, this could be obtained by
having students at AHS complete a questionnaire on paper or online (using Fathom,
surveymonkey.com, etc.). It could also be an actual experiment/simulation that you conduct using
a computer.
2. Secondary Data is information that you are taking from another source. It is important to use
reliable sources. When choosing your topic, be sure that you will be able to find good data. Some
places that students often obtain data from are as follows:



Let’s go on a bit of a tour…
http://www.statcan.gc.ca/start-debut-eng.html
Home login  go to the virtual library…Virtual tools….estat
University of Michigan Library of Statistics http://www.lib.umich.edu/govdocs/stats.html
Nation Master www.nationmaster.com
http://cas.illinoisstate.edu/jpda/finding_data/internationaldata.shtml
http://ontario.compareschoolrankings.org/secondary/SchoolsByRankLocationName.aspx
Use search terms like….data sets; statistics;
Wolfram Alpha
http://www.wolframalpha.com/
google trends (under categories; shows you the #of Google searches)
http://www.google.com/trends/
RDC
http://www.statcan.gc.ca/rdc-cdr/data-donnee-eng.htm
Open data project
http://www.data.gc.ca/default.asp?lang=En&n=F9B7A1E3-1
UN data
http://data.un.org/
http://data.un.org/Search.aspx?q=population
Datamob
http://datamob.org/datasets
Areas of data
CDC
http://www.cdc.gov/nchs/fastats/Default.htm
http://www.cdc.gov/nchs/products/hestats.htm
Health
http://www.nlm.nih.gov/hsrinfo/datasites.html
http://phpartners.org/health_stats.html
http://web4.uwindsor.ca/units/leddy/leddy.nsf/HealthStatistics!OpenForm (list of
Canadian sites for health info)
World Bank - Economic Data
http://data.worldbank.org/
http://data.worldbank.org/data-catalog
Open Data Hamilton, ON
http://openhamilton.ca/article/starting-hamiltons-open-data-sets
Freebase -gives info and data
http://www.freebase.com/
Data.gov
http://www.data.gov/
List of open datasets
http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public
Library and Archives Canada
http://www.collectionscanada.gc.ca/opendata/900-1000-e.html
Open Government
http://open.gc.ca/index-eng.asp
Canadian data
http://data.gc.ca/eng
Health:
-- Teen Drug Use and Abuse (Jeff Madeiros and Corey Hoekstra, St. Peter H.S., 2008)
-- Obesity and Diabetes: A Growing Epidemic (Jennifer Brierley, St. Pius X H.S., 2006)
-- Teen Pregnancy and Abortion (Nicole Sanger and Brittany Burg, St. Peter H.S., 2006)
-- Canada's Health by Region (presentation, Colin McClenaghan and Jack Wei, K.C.V.I., 2006)
-- Dietary Habits of Canadians (Bronwyn, Hillcrest H.S., 2006)
-- Factors affecting Thyroid Condition (presentation, Shafaq, Nepean H.S., 2004)
-- Does the Marital Status of Parents affect their Kids? (presentation, Cassandra, Brandon, Victoria, Derek,
Mother Teresa HS, Ottawa, Dec. 2008)
Education:
-- Video game usage and Absenteeism in Canadian high schools (Bryan Smith, Orangeville D.S.S., 2008)
-- Education, Salary, and Career Paths (Patrick Jackson, North Dundas S.S., 2006)
-- Factors affecting Student Achievement (Lisa Hoople, Sacred Heart C.H.S., 2006)
-- Substance Use and Academic Performance (Charlie Berrigan, Smiths Falls C.I., 2006)
Politics and Economics:
-- Variables affecting Voter Turnout in Canadian Federal Elections (Shaun Banke, Holy Trinity H.S., 2006)
-- Political Opinions of Students (presentation, J. Maier and S. Gordon, St. Peter H.S., 2006)
-- Political Opinions of Students (dataset, J. Maier and S. Gordon, St. Peter H.S., 2006)
-- Analysis of the Canadian Federal Debt using E-STAT (Linda, Pickering H.S., 2004)
-- Factors affecting Income (Jodi Morden and Mike Curridor, Sacred Heart, 2003)
Transportation:
-- Factors that Influence Collisions (Dimitar Hristov, Vesko Avramov, and Ayman Barri, Brookfield H.S.,
2006)
-- Why do Young People pay more for Auto Insurance? (T.J., Opeongo H. S., 2005)
-- Teenager Driving Infractions (Erin Knox and Katherine Renner, Sacred Heart C.H.S., 2006)
Environment and Energy:
-- Global Warming (Renpeng Sun, Nepean H.S, 2008)
-- Greenhouse Gas Emissions (Sarah Deslippe, K.C.V.I., Kingston, 2006)
-- Carbon Dioxide Emissions (Mathew Hall, Dr. Williams S.S., 2006)
-- The Future of Electricity (Jonathan Thomas, Opeongo H.S., 2006)
-- Trends in Energy Consumption in Canada (Chenyu Bing, Sir Robert Borden H.S., 2006)
Other:
-- Unemployment & Divorce in Canada (Rachel Wang, Glebe C.I., 2008)
-- The Effect of Tourism on GDP in Canada (Feifei C., L�Amoreaux C.I., 2007)
-- Travellers in Canada (Gosia, Sacred Heart C.H.S., 2005)
-- Investigation on the effects of Data Manipulation (Matt, Will, and James, Carleton Place H.S., 2005)
-- Factors affecting Internet Use (Bryan W., Earl of March S.S., 2004)
Download