Data Management

advertisement
Data Management: Project Lesson
Overview
Intro:
 Sources of data
 Choosing a question
Learning Goal of this lesson:
-to help you choose a topic for which data exists and is accessible to you
-to help you find the data to support your project
-to teach you how to access quality information from Statscan. It is the public face of our
government statistics bureau)
1. Statistics in the News
http://google.com
Search for statscan news
http://www.thestar.com/business/tech_news/2013/11/01/statscan_data_points_to_canadas_gro
wing_digital_divide_geist.html
Canadians & literacy, numeracy
http://www.sunnewsnetwork.ca/sunnews/canada/archives/2013/10/20131008-102228.html
http://ca.news.yahoo.com/adding-up-the-ways-we%E2%80%99re-falling-behind-in-education194540699.html
http://www.nationalpost.com/m/wp/sports/mlb/blog.html?b=sports.nationalpost.com/2013/0
9/24/more-than-two-thirds-of-quebecers-want-major-league-baseball-back-in-montrealaccording-to-poll
Statscan twitter
2. What are statistics used for?
Medical studies; Epidemiology (tracking disease outbreaks)
Merchandising (tracking consumer purchases)
Government policy
Personal decisions – stocks, investments, productivity
Sports (draft position)
http://www.tsn.ca/columnists/scott_cullen/?id=267960
A move towards “evidence-based decision making”
3. Good graphs; Bad graphs
Animating data to produce “information”
http://www.youtube.com/watch?v=jbkSRLYSojo
How to Make Data Look Sexy (CNN)
Are women bad a math?
http://www.slate.com/blogs/xx_factor/2013/08/29/are_women_bad_at_math_graphs_r
efute.html
Bad graphs
http://misterguch.brinkster.net/graph.html
http://gator.gatewayk12.org/~smcgrail/myweb/powerpoint/misleading_graphs/here_ar
e_some_examples_of_mislea.htm
http://en.wikipedia.org/wiki/Misleading_graph
Bad graphs
http://lilt.ilstu.edu/gmklass/pos138/datadisplay/images/phillips2.jpg
Three elements of bad graph design:
Data Ambiguity, Data Distortion, and Data Distraction.
http://lilt.ilstu.edu/gmklass/pos138/datadisplay/badchart.htm
Types of graphs
http://cas.illinoisstate.edu/jpda/charting_data/fillesbysection.shtml
Scatterplots (correlation vs causation)
4. How do I find a topic?
Where to get ideas for your project:
1. Do a literature search using the databases (Virtual Library) (secondary; library) to
see
 What relationships are out there
 What data has been collected already
2. Browse StatsCan to see what categories may be of interest to you.
3. Other online searches (include the term statistics or MDM 4U)
http://mathforum.org/workshops/usi/dataproject/usi.hslessons.html
MDM4U resources on the Brock University website
http://www.brocku.ca/cmt/mdm4u/intro.htm
http://www.brocku.ca/cmt/mdm4u/resr/index.html
4. Brainstorm topics of interest
Exemplars
5.
http://schools.hwdsb.on.ca/highland/files/2011/02/MDM4U-Final-Project-PGA.pdf
http://teacherweb.com/ON/statistics/Math/photo3.stm
Proposal Phase:
1. Do some presearching; find background information on the topic; determine what is
already known in the field
2. Thesis – your main thesis question/statement and the sub-problems you are going to
answer.
3. Determine the Population you will seek out or the Sample you will use
4. Analyze – Explain each of the following:
1. What are the main variables in your question?
2. Can these variables be measured statistically?
3. Is there enough data to make an interesting analysis?
5. Hypothesis – Predict what do you expect to find / observe?
6. Why is it important for you to investigate this topic? Who is it most relevant to?
7. Data – Include either 1) all of the raw data that you are going to use from the Internet,
books, etc., sourced. NOTE: For large datasets a 1-page sample including a WWW link
(with Name/Title) to the rest is sufficient. I need to know how the researchers got their
data. OR 2) the survey that you are going to use. It should not be distributed yet.
HINT: Start your Bibliography as soon as you find your first useful web site. Trying to go
back and find information later is a nightmare.
5. Research and Project Design
1. Title Page
2.
Table of Contents
- Include section headings and page numbers
- NOT numbered
3. Summary (like an abstract)
- Do not write this until you are finished your project!
- Page numbering starts here (1) – insert a section break
- In one page, briefly summarize your entire report.
- A summary section is something that would be read by a manager who didn’t have enough
time to read the entire report, so make sure that you have enough details that it can stand by
itself.
- At the very least, include the following information:
- Problem: A clear statement of what you are trying to learn
- Plan: The procedure you will use to carry out the study (How do you choose people?
How do you measure? Who does the measuring? What methods are you going to use?)
- Data: The data are collected according to the plan (What data did you collect? Where
did it come from?)
- Analysis: The data are summarized and analyzed to answer the thesis question
(numerical, graphical, informative sentences)
- Conclusions are drawn about what has been learned (note any biases, suggest further
studies)
4. Problem






Main thesis question. The thesis question is the theme of your report (e.g. What is the relationship
between an NBA player’s salary and their success?). Try to use the word “relationship” in your thesis
question. Remember, you do not have the tools to try and find any cause and effect.
Sub-questions: The sub-questions are the smaller questions that you will answer that will lead you to
conclude on your main thesis question. These should be specific enough that they contain your
variables that you will compare. The problems may evolve slightly throughout the life of your project.
(e.g. What is the relationship between salary and a player’s points per game? What is the relationship
between salary and a player’s rebounds per game? What is the relationship between salary and the
number of games that a player has won?)
Hypothesis – What do you expect to find?
Define the population and describe the characteristics of the population (e.g. all players in the NBA
that played at least 70 games during the 2011-12 regular season).
Define the independent variables (e.g. points per game in 2011-12 NBA regular season, rebounds
per game in 2011-12 NBA regular season)
Define the dependent variables (e.g. player salary).
5. Plan



Select the sampling method and justify your choice
Design and explain the Experiment/Survey/Questionnaire/Data Collection process.
Identify any possible biases
NOTE: if the data is not your own, you need to find out as much of the above information as possible and
point out the parts that you don’t know.
6. Data



Put all of your raw data collected in an appendix, not in this section
Include summaries of your key variables here (frequency tables – but not histograms or graphs)
Identify all problems you ran into with your data (Did you need to ‘massage’ it to use it in
Excel/Fathom? Did you alter the scale?)
7. Analysis
For each sub-question identified, use the concepts we learned in class to describe the data or find
trends/relationships. Only include those that are relevant.
(a) Numerical Statistics (your report must include at least 3)
 Find means, modes, and medians
 Find the standard deviation, Q1 , Q3 , IQR, percentiles
 Use linear regression and find the correlation coefficient, equation of a line of best fit
 Use non-linear regression and find the coefficient of determination, equation of a curve of best fit
 Relate your data to the Normal Distribution, Binomial Distribution or another distribution.
 Use z-scores and z-tables to find some useful information.
 Permutations, Combinations and Probability:
- Predict the probability of certain events using your model
- Do something else relating to probability
- Use a simulation to help you discover a probability
- Use the binomial theorem
- Create a probability distribution
(b) Graphical Representations (you must include at least 3)
 Scatter plots (this should be included in every project as you will be finding many relationships)
 Bar graph / histogram / frequency polygon (histogram + curve) / cumulative frequency polygon
(each freq. is a cumulative total) / relative frequency polygon (freq. as a %) / line graph / moving
average
 Box and whisker
(c) Information – descriptive sentences. This part is very important and often overlooked by students.
Don’t just provide numbers and statistics. Be sure to interpret them for the reader. What do the numbers
tell you? Include this with each concept / graph.
8. Conclusion



Draw conclusions that directly relate to your thesis.
Note any biases that you believe occurred in your study.
Make suggestions for further/follow-up studies or any modifications that would make to the
current study.
9. Bibliography
Web sites cited using APA format.
Research Cycle: Steps to Success
Maximizing Evidence and Data Sheets (from HWDSB eBEST)
1) Develop Your Question




Identify the area of interest (issue, concern, untested hypotheses, unanswered question, etc.).
Define in clear, specific terms, the actual, specific problems that will be the focus of your investigation.
Identify the variables you are interested in. What (I.V.) causes what (D.V.)?
Identify the independent and dependent variables.
2) Gather Existing Evidence/Data on Your Topic
 You may conduct a literature search related to the problem area.
 You may examine any available data/information.
3) Make a Prediction
 Based on the literature that you have read, what is your hypothesis?
 The hypothesis should engage the two variables…change in the I.V. affects the D.V. in some way.
poverty rises, then rates of attending university will fall.
 Be specific and attempt an explanation of why you think the hypothesis is true.
As
4) Make a Study/Evaluation Plan
Will you seek out existing data and repurpose it or create your own through a survey?
If you are using existing data…you need to
 Evaluate the quality of the source of your data
 Determine how it was collected
 Was it linked to by others?
See if other sources support this source.
This is done by adding “link:” to the beginning of the URL (e.g., type the following into a Google search
window… link:http://www.hwdsb.on.ca)
 How was the data collected? Is the data for the two variables related?
 If you are studying the hypothesis above you cannot get poverty statistics from Alberta and
university attendance rates from Ontario.
 If you were studying teen sleep and Grade 12 averages, you must use data that was collected from
the same students for whom you will gather grade data.
If you do a survey…you need to determine








Who are the participants?
What data/information will you collect?
Are there any ethical issues to consider?
Where will the data collection take place?
When do you plan to collect the data? Once? More than once?
How will you collect the data? What tools/instruments will you use?
Do you plan to include an intervention?
What type of analysis will you use to make sense of the findings?
5) Collect Data/Gather the Information
 Put your study/evaluation plan in place!
 Watch for confounds
6) Examine/Explore/Analyze Data
 Identify and describe the key findings (look for trends, graph the data, examine other relationships).
 Graphs should be made of summary data only.
 Graphs should always show the I.V. on the horizontal or x-axis and the D.V. on the vertical or y-axis.
 Graphs should be appropriate to the type of data (bar graphs for comparison of non-continuous
information and line graphs for continuous data)
o Continuous data is data for which values exist at all levels (e.g. time)
o Non-continuous data is data such as eye colour (either blue or brown or hazel or other categories
you create)
 Some important considerations:
o What were the most important findings? And why are they so important?
o To whom are the findings most relevant?
o What are the limitations of the research?
Potential Problems:
Almost EVERY problem that I have seen on final projects was because of an incomplete or poorly done
proposal phase. The following factors have created flawed projects in past Data Management courses:





Projects were far too large in scope. A research team of 100 working for 25 years would be unable
to prove causation in the way that these students wished to do. This happens most often with
projects like drunk driving, teenage pregnancy or economic problems. Choose less glamorous and
smaller topics that you can find data about. Make sure data is available.
Projects which attempted to prove causation instead of correlation.
Projects whose entire body of evidence was based on the unreliable sources from the Internet.
They made no attempt to figure out where their sources' data came from.
Projects where random sampling involved giving a survey to everyone in their class. Sample size
too small or too homogeneous.
Projects where the students developed their surveys first and their research questions second.
They ended up not asking the correct survey questions and were unable to prove their point.
6. Variables
Decide on variables
Developing a Good Research Question
Identify the variables you are going to study. Your preliminary research should allow you to develop an
hypothesis to relate two variables. One is called the independent variable and the other is a dependent
variable. Your hypothesis should be phrased such that you are studying the effect of the independent
variable on the other or dependent variable. You should attempt to find data sets that control any other
variables.
E.g. Do the number of hours children watch television per day affect levels of obesity?
Independent variable:
Dependent variable:
Controls:
Independent variable – The variable you think affects something else. You select
different levels for this variable and then see how this affects another variable.
You have to have an idea that the variables are related…that one causes the other to
change, that they are correlated.
Dependent variable This is the variable you would measure.
Controls: the standard to which you compare your results to ( some need to be
controlled (e.g. age/sex)
Relate them using an hypothesis:
Does A (I.V.) affect B (D.V.)? Be specific.
Define A and B fully.
7. Where can you find data?
First you will need to decide what type of data you will collect for your project:
1. Primary Data is information that you collect on your own. For example, this could be obtained by
having students at AHS complete a questionnaire on paper or online (using Fathom,
surveymonkey.com, etc.). It could also be an actual experiment/simulation that you conduct using
a computer.
2. Secondary Data is information that you are taking from another source. It is important to use
reliable sources. When choosing your topic, be sure that you will be able to find good data. Some
places that students often obtain data from are as follows:



Let’s go on a bit of a tour of Statscan
http://www.statcan.gc.ca/start-debut-eng.html
Home login  go to the virtual library…Virtual tools….estat
University of Michigan Library of Statistics http://www.lib.umich.edu/govdocs/stats.html
Nation Master www.nationmaster.com
http://cas.illinoisstate.edu/jpda/finding_data/internationaldata.shtml
http://ontario.compareschoolrankings.org/secondary/SchoolsByRankLocationName.aspx
Use search terms like….data sets; statistics;
Wolfram Alpha
http://www.wolframalpha.com/
google trends (under categories; shows you the #of Google searches)
http://www.google.com/trends/
RDC
http://www.statcan.gc.ca/rdc-cdr/data-donnee-eng.htm
Open data project
http://www.data.gc.ca/default.asp?lang=En&n=F9B7A1E3-1
UN data
http://data.un.org/
http://data.un.org/Search.aspx?q=population
Datamob
http://datamob.org/datasets
Areas of data
CDC
http://www.cdc.gov/nchs/fastats/Default.htm
http://www.cdc.gov/nchs/products/hestats.htm
Health
http://www.nlm.nih.gov/hsrinfo/datasites.html
http://phpartners.org/health_stats.html
http://web4.uwindsor.ca/units/leddy/leddy.nsf/HealthStatistics!OpenForm (list of
Canadian sites for health info)
World Bank - Economic Data
http://data.worldbank.org/
http://data.worldbank.org/data-catalog
Open Data Hamilton, ON
http://openhamilton.ca/article/starting-hamiltons-open-data-sets
Freebase -gives info and data
http://www.freebase.com/
Data.gov
http://www.data.gov/
List of open datasets
http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public
Library and Archives Canada
http://www.collectionscanada.gc.ca/opendata/900-1000-e.html
Open Government
http://open.gc.ca/index-eng.asp
Canadian data
http://data.gc.ca/eng
Citing
Examples of Data Citations
Always check your syllabus or author guidelines to see if they contain directions for citing data. Some data distributors will suggest
citations that you may use. Most common style guides do not give specific instructions for citing data; however, here are three
examples from those that do:
Publication Manual of the American Psychological Association (APA), 6th Edition
Pew Hispanic Center. (2004). Changing channels and crisscrossing cultures: A survey of Latinos on the news media [Data file and
code book].
Retrieved from http://pewhispanic.org/datasets/
How do I cite data?
When you're writing a research paper, it is necessary to cite your use of sources, typically as footnotes at the bottom of the page or in a
bibliography at the end of the paper. It is crucial to provide references for your reader to better understand the context of your research
and to give credit for people's work that you've used. As research becomes more data-intensive, it is important to cite your use of
datasets in addition to traditional publications such as journal articles, books, and conference proceedings.
Digital datasets come in a wide variety of formats. Some examples include:







spreadsheets
interview transcripts
sensor and instrument readings
high resolution images
gene sequences
software source code
video recordings
* The emerging best practice is to cite data just as you would cite a research article. *
Most traditional forms of documents are not capable of representing these kinds of data, and so datasets can be published separately
in data repositories and other web sites. Whether you produced the data yourself or you're using someone else's data in your research,
it is important to maintain a linkage between your paper and its supporting datasets by citing them. Not only does this give credit to the
person who created the data, but it enables others to reproduce your research and verify your results. In some cases, sharing a dataset
may have more scholarly impact than publishing a book or journal article.
There are many challenges in citing data. In most disciplines, there are no clear instructions on how to cite data. In fact, most of the
major style guides (APA, MLA, the Chicago Manual of Style) do not directly address the issue of data citation. Data is not recognized as
a format in many citation management tools and tutorials. Some kinds of data are dynamic, such as a weather dataset, and may
change every hour or every day, so it's difficult to know what to cite.
Here are some tips for citing data properly:



Always look for instructions in your syllabus or the author guidelines on how to cite data. You may be able to find examples
from previously published papers to imitate.
The distributor you downloaded the dataset from may suggest a citation. Some examples include ICPSR, OECD, and Dryad.
If there are no explicit instructions for citing data, there may be instructions for a similar format such as citation styles for
electronic resources, web pages, or tables that can be used.
Try to capture these important elements in your data cititation:






Who produced the dataset (creator or author)
The title of the dataset
The unique identifier of the dataset, perferably a Digital Object Identifier (DOI) or minimally a link to the dataset if it is online
The date the datasets was published and its version number, if it has one
The date and time the dataset was accessed
The distributor of the dataset
Keep in mind that some datasets are dynamic and change over the course of time. Always try to cite the specific version of the dataset
that you used. Some distributors provide a checksum to ensure that the dataset hasn't been changed or corrupted since it was
published, which may be included in a source note. Other important information for understanding and using the dataset may be
included in supplementary files (e.g., codebook, readme.txt) that may be available at the same link in the citation or in the source notes
of your paper.
Responsible Use of Data
Be sure to examine the license associated with the data you're citing, to make sure your use is acceptable. If the dataset is derivative of
one or more other datasets, you may need to review their licenses and credit their sources also. If you're including a substantial portion
of someone else's data in your paper, you may need to seek their permission. Some data distributors request that you submit your
citation to them to help them track the use of their data.
Is the data that you're citing accurate? Is the dataset described and presented in a way that users will recognize and use it
appropriately? Does the data contain sensitive information, such as phone numbers or other personal identity information? If your
research includes the use of human subjects, you will need to confirm that your data meets the requirements set by your institutional
review board (IRB) or other ethical norms.
Data vs. Datum
Remember: data is plural. The singular form of data is datum, which means a "data point". It sounds odd, but it is grammatically correct
to say "The data show us.." and not "The data shows us..."
Health:
-- Teen Drug Use and Abuse (Jeff Madeiros and Corey Hoekstra, St. Peter H.S., 2008)
-- Obesity and Diabetes: A Growing Epidemic (Jennifer Brierley, St. Pius X H.S., 2006)
-- Teen Pregnancy and Abortion (Nicole Sanger and Brittany Burg, St. Peter H.S., 2006)
-- Canada's Health by Region (presentation, Colin McClenaghan and Jack Wei, K.C.V.I., 2006)
-- Dietary Habits of Canadians (Bronwyn, Hillcrest H.S., 2006)
-- Factors affecting Thyroid Condition (presentation, Shafaq, Nepean H.S., 2004)
-- Does the Marital Status of Parents affect their Kids? (presentation, Cassandra, Brandon, Victoria, Derek,
Mother Teresa HS, Ottawa, Dec. 2008)
Education:
-- Video game usage and Absenteeism in Canadian high schools (Bryan Smith, Orangeville D.S.S., 2008)
-- Education, Salary, and Career Paths (Patrick Jackson, North Dundas S.S., 2006)
-- Factors affecting Student Achievement (Lisa Hoople, Sacred Heart C.H.S., 2006)
-- Substance Use and Academic Performance (Charlie Berrigan, Smiths Falls C.I., 2006)
Politics and Economics:
-- Variables affecting Voter Turnout in Canadian Federal Elections (Shaun Banke, Holy Trinity H.S., 2006)
-- Political Opinions of Students (presentation, J. Maier and S. Gordon, St. Peter H.S., 2006)
-- Political Opinions of Students (dataset, J. Maier and S. Gordon, St. Peter H.S., 2006)
-- Analysis of the Canadian Federal Debt using E-STAT (Linda, Pickering H.S., 2004)
-- Factors affecting Income (Jodi Morden and Mike Curridor, Sacred Heart, 2003)
Transportation:
-- Factors that Influence Collisions (Dimitar Hristov, Vesko Avramov, and Ayman Barri, Brookfield H.S.,
2006)
-- Why do Young People pay more for Auto Insurance? (T.J., Opeongo H. S., 2005)
-- Teenager Driving Infractions (Erin Knox and Katherine Renner, Sacred Heart C.H.S., 2006)
Environment and Energy:
-- Global Warming (Renpeng Sun, Nepean H.S, 2008)
-- Greenhouse Gas Emissions (Sarah Deslippe, K.C.V.I., Kingston, 2006)
-- Carbon Dioxide Emissions (Mathew Hall, Dr. Williams S.S., 2006)
-- The Future of Electricity (Jonathan Thomas, Opeongo H.S., 2006)
-- Trends in Energy Consumption in Canada (Chenyu Bing, Sir Robert Borden H.S., 2006)
Other:
-- Unemployment & Divorce in Canada (Rachel Wang, Glebe C.I., 2008)
-- The Effect of Tourism on GDP in Canada (Feifei C., L�Amoreaux C.I., 2007)
-- Travellers in Canada (Gosia, Sacred Heart C.H.S., 2005)
-- Investigation on the effects of Data Manipulation (Matt, Will, and James, Carleton Place H.S., 2005)
-- Factors affecting Internet Use (Bryan W., Earl of March S.S., 2004)
Download