revised - BioQUEST Curriculum Consortium

advertisement
Simple Database Construction Using
Local Sources Of Data
Dr. D. Timothy Gerber
Associate Professor
Biology Department, Cowley Hall
University of Wisconsin - La Crosse
1725 State Street, La Crosse, WI 54601
Email: gerber.dani@uwlax.edu
Phone: 608.785.6977 (office), 608.785.6959 (fax), 608.781.5824 (home)
Dr. David M. Reineke
Assistant Professor
Mathematics Department, Cowley Hall
University of Wisconsin - La Crosse
1725 State Street, La Crosse, WI 54601
Email: reineke.davi@uwlax.edu
Phone: 608.785.6607 (office), 608.785.6602 (fax), 608.779.5603 (home)
Word count: 3,151
2
"Information is all around us – often in such great quantities that we are
unable to make sense of it. A set of data can be represented by a few
summary characteristics that may reveal or conceal important aspects
of it. Statistics is a form of mathematics that develops useful ways for
organizing and analyzing large amounts of data." AAAS (1990, p. 137)
Abstract: With the increased accessibility of information in our society, databases have
become a common way to organize and distribute data. To best understand how
information is organized in a database, students need to see firsthand how they are
constructed. Construction of three simple databases using a spreadsheet is described here
and basic summary statistics are provided for each. Recommendations for building
simple databases using a computer spreadsheet and the statistical analysis of its data are
given.
Introduction: As in the above quote, information is truly all around us. Many of the
informed decisions scientists, government officials, industrial analysts and others in our
society make revolve around summarizing information amassed in large sets of data or
databases. In biology, this information can range from human health (Kaiser, 2002) and
genetic data (FIDD, 1999) to species descriptions (e.g., Guiry & Nic Dhonncha, 2002) to
quantitative limnological data (e.g., NODC, 2001). Readily amenable to manipulation by
computer and highly organized, databases and statistics are excellent tools for analyzing
and summarizing quantitative measurements and are important for scientific
interpretation/explanations of natural phenomena (NRC, 1996, p. 118).
3
Since measurement is such an integral part of data collection in many biological fields
and computers/software have become increasingly available, organizing quantitative
information in databases for statistical analysis seems a reasonable way to integrate math
and science (AAAS, 1990, p. 212) in the classroom. Equally, information in one
database can be used collaboratively between math and science classrooms (NRC, 2000,
p. 141).
Unfortunately, while databases can offer incredible amounts of raw, meta-,
and/or summarized data, using a database as an introduction to information processing
can be daunting, confusing, and even a ‘turn off.’ A less daunting way to introduce
students to databases is to have them build their own.
Student-built databases using local information sources can serve several functions in a
biology, math or combined classroom(s). (1) The actual process of collection and
manipulation of data allows students to internalize or give meaning to the numbers in a
database. In essence, they better acquire a ‘feel for the data.’ (2) Using local
information can be an interesting lesson. Databases and the statistics generated from
them seem sterile and objective when viewed in a textbook or downloaded from a
website. They are anything but that when one understands how they are constructed,
where the data come from, and the assumptions behind their construction and
calculations. (3) Databases can be included as part of a spiraling curriculum.
Constructed at a lower grade level, additions and manipulation of information to a
dynamic database can be used at higher grade levels with increasing sophistication. (4)
Student-built databases connect science and math with integrated levels of understanding.
4
A database could literally be used in both science and math classes at one grade level or
potentially in these classes from elementary to high school.
Following the idea that “[s]ound teaching usually begins with questions and phenomena
that are interesting and familiar to students, not with abstractions or phenomena outside
their range of perception, understanding, or knowledge” (AAAS, 1990, p. 201), the
purpose of this paper is to describe how a simple database can be easily constructed from
quantitative data using a computer-generated spreadsheet. Constructed using quantitative
information collected from a local newspaper, vital statistics (e.g., natality, mortality)
data can easily be statistically summarized and graphically displayed using commonly
available software. While there are papers on the use of existing databases (e.g.,
Putterbaugh & Burleigh, 2001; LaBare et. al., 2000; Capelle & Smith, 1998), few
address the basic construction of a database using student-collected information. To best
understand the value of a database, students need to understand how they are constructed.
Settings:
The authors have used these databases in college biology (Bio 103, Introductory Biology,
non-majors) and mathematics (Mth 205, Elementary Statistics) courses to teach
biological and statistical concepts. As courses that fulfill general science and math
requirements on our campus, Bio 103 and Mth 205 are taken by students with wide
ranging educational backgrounds. In addition, many K-12 pre-service teachers take these
courses. Computer facilities are available for student use.
5
Information for databases used in Bio 103 were collected (see details in Methods below)
by students early in the semester and emailed to the instructor for inclusion in one large
master database or were instructor generated. After completion, the master database was
emailed back to each student as an attachment. Basic statistical data manipulation using
the master database is part of the lecture component in Bio 103. The constructed
databases are shared with Mth 205 students.
Methods:
Three separate databases for (1) Baby (all infants), (2) Twin babies and (3) Obituary data
were constructed from information in the La Crosse Tribune (local newspaper). One of
our local hospitals now has baby information online (see Gundersen-Lutheran in
reference section). Length, weight (continuous variables), sex (categorical variable), date
and time of birth data were published bimonthly for all babies (twins are identified) born
at Gundersen Lutheran Hospital, La Crosse, WI. Sex, birth and death data, published
daily, were collected from the obituary column. No names were recorded in building
Baby (all infants) or Obituary databases, however, surnames were used to keep track of
twins. Care was taken not to double count the same person in the obituary column since
a person is usually listed two successive days.
Newspaper data were entered into a Microsoft Excel 2000 (hereafter Excel) spreadsheet.
Excel was chosen because of its ubiquity (included with the Office 2000 suite of
programs) and basic statistics analysis capability. For those unfamiliar with Excel, use
the “Help” pull-down bar within the program or consult a general reference (e.g., Shelly
6
Cashman Series, 2000). Basic database structure and terminology can be found in
Spooner & Barracato (1999). Graphs were generated using SPSS, an easy-to-use menudriven software package for statistical processing, which accepts Excel spreadsheets
(SPSS, 2002). The sampling tool in the Excel Analysis ToolpakTM was used to draw
random samples from database “populations” and to compute descriptive statistics and
confidence intervals.
Results:
Each of the three databases (Baby (all infants), Twins, Obituary) was easy to generate
even with only a rudimentary knowledge of Excel (see Fig. 1 for basic database setup).
However, 10-15 Bio 103 students usually needed additional help in using Excel and
attaching email files. This problem was quickly solved with one instructor-led, ‘handson’, computer lab session (approx. 1 hour) on entering data into Excel and a discussion of
attaching files to email messages. The Baby (all infants) and obituary databases were
produced at a rate of approximately 70-100 babies or deaths/month and can be used in a
month’s time as a good sample of human birth weight. In our local newspaper, births are
listed once every other Saturday. The Twins database took much longer to develop since
only 0-5 sets of twins were listed monthly. This database takes semesters to develop;
thus, its development is long relative to the Baby (all infants) or obituary databases.
There were slight, but not statistically significant, differences in birth weight between
males and females and a few outliers were discovered, as shown by the boxplots in
Figure 2. Overall birth weight (Fig. 2) was similar to what is found in much larger data
7
sets for the United States (e.g., Wilcox et. al., 1995). Birth weight, birth length, and
length of human life span (vital statistics) were excellent measures to use for database
building and basic statistical analysis for several reasons. (1) Biologically, vital statistics
convey important information concerning the human condition. For example, human
birth weight is associated with individual infant survival and a population’s infant
mortality (Wilcox, 2001). (2) Statistically, these measures often show a normal or bellshaped distribution (Wilcox, 2001), important for assumptions of parametric statistical
tests. (3) From an educational view, even young students should be familiar with or can
easily understand what these measures are and how they are determined. There is also a
strong positive correlation between baby birth weight and length (Fig. 3), which can be
used to introduce correlation and regression using a student-generated database.
Significant differences in weight between single births and twin births for males and
females are displayed clearly by the graphic comparison shown in Figure 4. Differences
in human life expectancy by sex also provide a nice graphic comparison (Fig. 5) using
our obituary database and can be compared with what students know about average life
expectancy. This database can be used to discuss statistical calculations based on a
population and samples of various sizes (Fig. 6). Data were graphically represented
using boxplots (Fig. 2), scatterplots (Fig. 3), histograms (Fig. 5) and confidence intervals
(Figs. 2 & 4).
8
Discussion:
“Using data from actual investigations from science in mathematics courses, students
encounter all the anomalies of authentic problems – inconsistencies, outliers, and errors –
which they might not encounter with contrived textbook data.” (NRC, 1996, p. 214)
Creating your own database is an excellent way for students to learn the trials and
tribulations of data collection and data management. It provides an opportunity to
discuss ethical issues in data collection as well as data integrity. Furthermore, students
will see that data in the “real world” doesn’t always present itself as neatly as it appears
in textbooks or web-based databases, but that it needs to be organized, carefully labeled,
and proofread. Sometimes part of a data record may be missing or recorded incorrectly,
giving rise to unusually large or small values. In these situations, students should be
taught the difference between an outlier and a data entry error. That is, that legitimate
data errors are to be corrected (where correction is possible) or removed from the
database, but that outliers are to remain and be dealt with appropriately.
A database can also be used to illustrate the concepts of population and sample. For
example, the entire database can be defined to be a hypothetical population of interest
and a random sample of a given size can be drawn from it, as shown in Figure 6. The
descriptive statistics from the sample can then be compared to the corresponding
population parameters. Repeated sampling can be used to demonstrate the variability of
sample statistics, which may be followed up by a discussion of sampling distribution
theory. This can easily be done in Excel using the Sampling tool in the Analysis
9
ToolpakTM. Naturally, such a discussion would lead to statistical inference for students in
grades 9–12 or in a university-level elementary statistics course. Constructing
confidence intervals and conducting hypothesis testing using random samples from a
database affords students the rare opportunity of having complete knowledge of the
population from which the sample came.
Biologically, most populations that researchers are interested in studying are so large that
it is not possible to have complete knowledge of them, making clear the idea of the
necessity of statistics as a discipline and the need to account for and understand random
variation that occurs with random sampling. Using vital statistics to build a database
provides students an opportunity to investigate, discover, and collect “real” data using
biologically important measures they can understand. Building a simple database with
student-collected data offers an excellent opportunity to connect the biological with the
mathematical and produces collaboration between students as well as faculty.
Databases in the Classroom:
“To take hold and mature, concepts must not just be presented to students from time to
time but must be offered to them periodically in different contexts and at increasing
levels of sophistication.” (AAAS, 1990, p. 207)
At the K-12 level, simple databases can easily be performed using spreadsheet software
(e.g., Excel), a calculator with a spreadsheet function (Morgan, 1997), or as a pencil and
paper exercise. Classroom-generated databases can easily be compared with trends for
10
the nation, too (see NCHS, 2003 ). In addition, many of the education standards for data
analysis and probability for grades 3-12 can be addressed through the assembly and use
of databases (Table 1). While the sophistication of statistical analysis will vary
drastically from lower grades to the college level, database construction and data
summarization offer the opportunity to use these exercises throughout much of the formal
educational training a student receives.
Regardless of grade level, several words of caution should be mentioned. (1) When
using a local information source, students may know people in the databases they are
constructing. This may be an advantage, if for example, a student has a new baby sister
listed in the birth announcement section of the newspaper and her information is included
in a database. However, it may be devastating for a student, whose uncle was just killed
in a car accident and is now listed in the obituary section, to include him in a database.
(2) Database construction in a classroom will not necessarily be easy. Missing data or
measurement problems of some sort are likely to be encountered. Such situations can be
exploited to teach students that data collection is often “messy” and that it is essential to
be as careful and accurate as possible. Another pitfall is the tendency to view the
database as a “random sample” when that is not likely to be the case. Instructors will
want to be careful to define exactly what the database represents, which is more likely to
be a well-defined population than a random sample. This point has more relevance for
secondary and university-level students covering statistical inference procedures because
they require that samples be randomly selected.
(3) Database construction will be time-
consuming for both students and instructor, especially when first beginning. We
11
recommend starting with a small, easily controlled but meaningful data set. Complexity
can be built into databases over time.
You may request the three databases we have developed by emailing the first author.
When emailing, please include your name, institution/school, city, state/province, and
country so that we may keep a record of requests. Databases used for this publication
will be emailed to you as attached Excel files. Included in our databases are the compiled
Baby-(all infants), Twin babies, and Obituary raw data collected from the La Crosse
Tribune. These databases may be freely used for educational purposes, however, it is
suggested that they be used as examples. It is preferable to build your databases using
student-collected data. The data was not double checked for accuracy.
Acknowledgements: The authors thank L. Gerber and two anonymous reviewers for
comments on the original manuscript.
References:
American Association for the Advancement of Science (AAAS). (1990). Science for All
Americans. New York: Oxford University Press.
Capelle, J. & M. Smith. (1998). Using cemetery data to teach population biology &
local history. The American Biology Teacher 60: 690-693.
Frequency of Inherited Disorders Database (FIDD) (1999).
http://archive.uwcm.ac.uk/uwcm/mg/fidd/index.html
Guiry, M. D. & Nic Dhonncha, E. (2002). AlgaeBase.
http://www.algaebase.org/default.html
Gundersen-Lutheran Hospital’s On-Line Nursery (http://www.gundluth.org/babies)
Kaiser, J. (2002). Population databases boom, from Iceland to the U.S. Science
298(5596): 1158-1161.
12
LaBare, K., R. Klotz, & E. Witherow. (2000). Using online databases to teach
ecological concepts. The American Biology Teacher 62(2): 124-127.
Morgan, L. (1997). Explorations: Statistics Handbook for the TI-83. Texas Instruments
Inc.
National Center for Health Statistics website. 2003. (http://www.cdc.gov/nchs/)
National Council of Teachers of Mathematics (NCTM). (2000). Principles and
Standards for School Mathematics. Reston, VA: The National Council of Teachers of
Mathematics, Inc.
National Research Council (NRC). (1996). National Science Education Standards.
Washington D. C.: National Academy Press.
-----. (2000). Inquiry and the National Science Education Standards: A Guide for
Teaching and Learning. Washington D. C.: National Academy Press.
National Oceanographic Data Center (NODC) (2001) http://www.nodc.noaa.gov/
Putterbaugh, M. & J. Burleigh. (2001). Investigating evolutionary questions using
online molecular databases. The American Biology Teacher 6: 422-431.
Shelly Cashman Series. (2000). Microsoft Office 2000: Introductory concepts and
techniques. Cambridge, MA: Course Technology.
Spooner, B. & J. Barracato. (1999). Database Basics Skills Book. Arlington, VA:
National Science Teachers Association.
SPSS for Windows, Rel. 11.5.1. 2002. Chicago: SPSS Inc.
Wilcox, A.J. (2001). On the importance – and the unimportance – of birthweight.
International Journal of Epidemiolgy 30: 1233-1241. Online at:
http://eb.niehs.nih.gov/bwt/V0M3QDQU.pdf
-----, R. Skjaerven, P. Buekens, & J. Kiely. (1995). Birth weight and perinatal mortality:
A comparison of the United States and Norway. Journal of the American Medical
Association 273: 709-711.
13
Table 1. Selected science* (NRC, 1996) and math+ (NCTM, 2000) standards relevant to
this activity for K-12 grade levels.
Grade
3-5
5-8
5-8
6-8
6-8
6-8
9-12
9-12
9-12
9-12
Standard
Collect data using observations, surveys and experiments (p. 176) +
…tools and techniques to gather, analyze, and interpret data ( p. 145) *
Nature of science (p. 170) *
Find, use, and interpret measures of center and spread, including mean
and interquartile range (p. 248)+
Discuss and understand the correspondence between data sets and their
graphical representations, especially histograms, stem-and-leaf plots,
box plots, and scatterplots (p. 248)+
Use observations about differences between two or more samples to
make conjectures about the populations from which the samples were
taken (p. 248)+
Use technology and mathematics to improve investigations and
communications (p. 175) *
Understandings about scientific inquiry (p. 176) *
Understand the meaning of measurement data and categorical data, of
univariate and bivariate data, and of the term variable (p. 324)+
Understand how sample statistics reflect the values of population
parameters and use sampling distributions as the basis for informal
inference (p. 324)+
14
Captions for Figures
Figure 1. Example of the spreadsheet for the database of newborns (Baby (all infants)
database) in La Crosse, WI.
Figure 2. Boxplots of birth weight for newborns in La Crosse, WI. Circles represent
outliers in the data.
Figure 3. Scatterplot of birth weight vs. length for newborns in La Crosse, WI.
Figure 4. Mean birth weight in ounces of male and female newborn twins and singles
with 95% confidence intervals for newborns in La Crosse, WI.
Figure 5. Histograms for ages of males and females using the Obituary database.
Fig. 6. Example of an Excel spreadsheet with both raw (sex, year of death (YOD), year
of birth (YOB)) and calculated (age = YOD - YOB) obituary data and an embedded table
of statistics.
15
16
Weight (oz.)
Newborns in La Crosse, WI
200
180
160
140
120
100
80
60
40
20
N=
56
69
Female
Male
SEX
17
Newborns in La Crosse, WI
200
Weight (oz.)
180
160
140
120
100
SEX
80
60
40
14
Male
Female
16
18
20
Length (in.)
22
24
18
140
130
120
110
100
TY PE
90
80
Single
70
60
N=
Twi n
56
28
69
Female
22
Male
Sex
0
.0
.0
0
0.
13 0
0.
12 0
0.
1 1 .0
0
10
.0
90
.0
80
.0
70
.0
60
.0
50
.0
40
.0
30
20
10
0.
200
Frequency
200
Frequency
19
SEX= Female
400
300
100
0
AGE (Years)
SEX= Male
400
300
100
0
0
.0
.0
.0
.0
.0
.0
.0
.0
.0
0
0.
13 0
0.
12 0
0.
1 1 .0
0
10
90
80
70
60
50
40
30
20
10
0.
AGE (Years)
20
21
Verification
This is to verify that our manuscript is neither being nor has been accepted for
publication elsewhere.
D. Timothy Gerber
________________________________________
David M. Reineke
________________________________________
Download