Simple Database Construction Using Local Sources Of Data Dr. D. Timothy Gerber Associate Professor Biology Department, Cowley Hall University of Wisconsin - La Crosse 1725 State Street, La Crosse, WI 54601 Email: gerber.dani@uwlax.edu Phone: 608.785.6977 (office), 608.785.6959 (fax), 608.781.5824 (home) Dr. David M. Reineke Assistant Professor Mathematics Department, Cowley Hall University of Wisconsin - La Crosse 1725 State Street, La Crosse, WI 54601 Email: reineke.davi@uwlax.edu Phone: 608.785.6607 (office), 608.785.6602 (fax), 608.779.5603 (home) Word count: 3,151 2 "Information is all around us – often in such great quantities that we are unable to make sense of it. A set of data can be represented by a few summary characteristics that may reveal or conceal important aspects of it. Statistics is a form of mathematics that develops useful ways for organizing and analyzing large amounts of data." AAAS (1990, p. 137) Abstract: With the increased accessibility of information in our society, databases have become a common way to organize and distribute data. To best understand how information is organized in a database, students need to see firsthand how they are constructed. Construction of three simple databases using a spreadsheet is described here and basic summary statistics are provided for each. Recommendations for building simple databases using a computer spreadsheet and the statistical analysis of its data are given. Introduction: As in the above quote, information is truly all around us. Many of the informed decisions scientists, government officials, industrial analysts and others in our society make revolve around summarizing information amassed in large sets of data or databases. In biology, this information can range from human health (Kaiser, 2002) and genetic data (FIDD, 1999) to species descriptions (e.g., Guiry & Nic Dhonncha, 2002) to quantitative limnological data (e.g., NODC, 2001). Readily amenable to manipulation by computer and highly organized, databases and statistics are excellent tools for analyzing and summarizing quantitative measurements and are important for scientific interpretation/explanations of natural phenomena (NRC, 1996, p. 118). 3 Since measurement is such an integral part of data collection in many biological fields and computers/software have become increasingly available, organizing quantitative information in databases for statistical analysis seems a reasonable way to integrate math and science (AAAS, 1990, p. 212) in the classroom. Equally, information in one database can be used collaboratively between math and science classrooms (NRC, 2000, p. 141). Unfortunately, while databases can offer incredible amounts of raw, meta-, and/or summarized data, using a database as an introduction to information processing can be daunting, confusing, and even a ‘turn off.’ A less daunting way to introduce students to databases is to have them build their own. Student-built databases using local information sources can serve several functions in a biology, math or combined classroom(s). (1) The actual process of collection and manipulation of data allows students to internalize or give meaning to the numbers in a database. In essence, they better acquire a ‘feel for the data.’ (2) Using local information can be an interesting lesson. Databases and the statistics generated from them seem sterile and objective when viewed in a textbook or downloaded from a website. They are anything but that when one understands how they are constructed, where the data come from, and the assumptions behind their construction and calculations. (3) Databases can be included as part of a spiraling curriculum. Constructed at a lower grade level, additions and manipulation of information to a dynamic database can be used at higher grade levels with increasing sophistication. (4) Student-built databases connect science and math with integrated levels of understanding. 4 A database could literally be used in both science and math classes at one grade level or potentially in these classes from elementary to high school. Following the idea that “[s]ound teaching usually begins with questions and phenomena that are interesting and familiar to students, not with abstractions or phenomena outside their range of perception, understanding, or knowledge” (AAAS, 1990, p. 201), the purpose of this paper is to describe how a simple database can be easily constructed from quantitative data using a computer-generated spreadsheet. Constructed using quantitative information collected from a local newspaper, vital statistics (e.g., natality, mortality) data can easily be statistically summarized and graphically displayed using commonly available software. While there are papers on the use of existing databases (e.g., Putterbaugh & Burleigh, 2001; LaBare et. al., 2000; Capelle & Smith, 1998), few address the basic construction of a database using student-collected information. To best understand the value of a database, students need to understand how they are constructed. Settings: The authors have used these databases in college biology (Bio 103, Introductory Biology, non-majors) and mathematics (Mth 205, Elementary Statistics) courses to teach biological and statistical concepts. As courses that fulfill general science and math requirements on our campus, Bio 103 and Mth 205 are taken by students with wide ranging educational backgrounds. In addition, many K-12 pre-service teachers take these courses. Computer facilities are available for student use. 5 Information for databases used in Bio 103 were collected (see details in Methods below) by students early in the semester and emailed to the instructor for inclusion in one large master database or were instructor generated. After completion, the master database was emailed back to each student as an attachment. Basic statistical data manipulation using the master database is part of the lecture component in Bio 103. The constructed databases are shared with Mth 205 students. Methods: Three separate databases for (1) Baby (all infants), (2) Twin babies and (3) Obituary data were constructed from information in the La Crosse Tribune (local newspaper). One of our local hospitals now has baby information online (see Gundersen-Lutheran in reference section). Length, weight (continuous variables), sex (categorical variable), date and time of birth data were published bimonthly for all babies (twins are identified) born at Gundersen Lutheran Hospital, La Crosse, WI. Sex, birth and death data, published daily, were collected from the obituary column. No names were recorded in building Baby (all infants) or Obituary databases, however, surnames were used to keep track of twins. Care was taken not to double count the same person in the obituary column since a person is usually listed two successive days. Newspaper data were entered into a Microsoft Excel 2000 (hereafter Excel) spreadsheet. Excel was chosen because of its ubiquity (included with the Office 2000 suite of programs) and basic statistics analysis capability. For those unfamiliar with Excel, use the “Help” pull-down bar within the program or consult a general reference (e.g., Shelly 6 Cashman Series, 2000). Basic database structure and terminology can be found in Spooner & Barracato (1999). Graphs were generated using SPSS, an easy-to-use menudriven software package for statistical processing, which accepts Excel spreadsheets (SPSS, 2002). The sampling tool in the Excel Analysis ToolpakTM was used to draw random samples from database “populations” and to compute descriptive statistics and confidence intervals. Results: Each of the three databases (Baby (all infants), Twins, Obituary) was easy to generate even with only a rudimentary knowledge of Excel (see Fig. 1 for basic database setup). However, 10-15 Bio 103 students usually needed additional help in using Excel and attaching email files. This problem was quickly solved with one instructor-led, ‘handson’, computer lab session (approx. 1 hour) on entering data into Excel and a discussion of attaching files to email messages. The Baby (all infants) and obituary databases were produced at a rate of approximately 70-100 babies or deaths/month and can be used in a month’s time as a good sample of human birth weight. In our local newspaper, births are listed once every other Saturday. The Twins database took much longer to develop since only 0-5 sets of twins were listed monthly. This database takes semesters to develop; thus, its development is long relative to the Baby (all infants) or obituary databases. There were slight, but not statistically significant, differences in birth weight between males and females and a few outliers were discovered, as shown by the boxplots in Figure 2. Overall birth weight (Fig. 2) was similar to what is found in much larger data 7 sets for the United States (e.g., Wilcox et. al., 1995). Birth weight, birth length, and length of human life span (vital statistics) were excellent measures to use for database building and basic statistical analysis for several reasons. (1) Biologically, vital statistics convey important information concerning the human condition. For example, human birth weight is associated with individual infant survival and a population’s infant mortality (Wilcox, 2001). (2) Statistically, these measures often show a normal or bellshaped distribution (Wilcox, 2001), important for assumptions of parametric statistical tests. (3) From an educational view, even young students should be familiar with or can easily understand what these measures are and how they are determined. There is also a strong positive correlation between baby birth weight and length (Fig. 3), which can be used to introduce correlation and regression using a student-generated database. Significant differences in weight between single births and twin births for males and females are displayed clearly by the graphic comparison shown in Figure 4. Differences in human life expectancy by sex also provide a nice graphic comparison (Fig. 5) using our obituary database and can be compared with what students know about average life expectancy. This database can be used to discuss statistical calculations based on a population and samples of various sizes (Fig. 6). Data were graphically represented using boxplots (Fig. 2), scatterplots (Fig. 3), histograms (Fig. 5) and confidence intervals (Figs. 2 & 4). 8 Discussion: “Using data from actual investigations from science in mathematics courses, students encounter all the anomalies of authentic problems – inconsistencies, outliers, and errors – which they might not encounter with contrived textbook data.” (NRC, 1996, p. 214) Creating your own database is an excellent way for students to learn the trials and tribulations of data collection and data management. It provides an opportunity to discuss ethical issues in data collection as well as data integrity. Furthermore, students will see that data in the “real world” doesn’t always present itself as neatly as it appears in textbooks or web-based databases, but that it needs to be organized, carefully labeled, and proofread. Sometimes part of a data record may be missing or recorded incorrectly, giving rise to unusually large or small values. In these situations, students should be taught the difference between an outlier and a data entry error. That is, that legitimate data errors are to be corrected (where correction is possible) or removed from the database, but that outliers are to remain and be dealt with appropriately. A database can also be used to illustrate the concepts of population and sample. For example, the entire database can be defined to be a hypothetical population of interest and a random sample of a given size can be drawn from it, as shown in Figure 6. The descriptive statistics from the sample can then be compared to the corresponding population parameters. Repeated sampling can be used to demonstrate the variability of sample statistics, which may be followed up by a discussion of sampling distribution theory. This can easily be done in Excel using the Sampling tool in the Analysis 9 ToolpakTM. Naturally, such a discussion would lead to statistical inference for students in grades 9–12 or in a university-level elementary statistics course. Constructing confidence intervals and conducting hypothesis testing using random samples from a database affords students the rare opportunity of having complete knowledge of the population from which the sample came. Biologically, most populations that researchers are interested in studying are so large that it is not possible to have complete knowledge of them, making clear the idea of the necessity of statistics as a discipline and the need to account for and understand random variation that occurs with random sampling. Using vital statistics to build a database provides students an opportunity to investigate, discover, and collect “real” data using biologically important measures they can understand. Building a simple database with student-collected data offers an excellent opportunity to connect the biological with the mathematical and produces collaboration between students as well as faculty. Databases in the Classroom: “To take hold and mature, concepts must not just be presented to students from time to time but must be offered to them periodically in different contexts and at increasing levels of sophistication.” (AAAS, 1990, p. 207) At the K-12 level, simple databases can easily be performed using spreadsheet software (e.g., Excel), a calculator with a spreadsheet function (Morgan, 1997), or as a pencil and paper exercise. Classroom-generated databases can easily be compared with trends for 10 the nation, too (see NCHS, 2003 ). In addition, many of the education standards for data analysis and probability for grades 3-12 can be addressed through the assembly and use of databases (Table 1). While the sophistication of statistical analysis will vary drastically from lower grades to the college level, database construction and data summarization offer the opportunity to use these exercises throughout much of the formal educational training a student receives. Regardless of grade level, several words of caution should be mentioned. (1) When using a local information source, students may know people in the databases they are constructing. This may be an advantage, if for example, a student has a new baby sister listed in the birth announcement section of the newspaper and her information is included in a database. However, it may be devastating for a student, whose uncle was just killed in a car accident and is now listed in the obituary section, to include him in a database. (2) Database construction in a classroom will not necessarily be easy. Missing data or measurement problems of some sort are likely to be encountered. Such situations can be exploited to teach students that data collection is often “messy” and that it is essential to be as careful and accurate as possible. Another pitfall is the tendency to view the database as a “random sample” when that is not likely to be the case. Instructors will want to be careful to define exactly what the database represents, which is more likely to be a well-defined population than a random sample. This point has more relevance for secondary and university-level students covering statistical inference procedures because they require that samples be randomly selected. (3) Database construction will be time- consuming for both students and instructor, especially when first beginning. We 11 recommend starting with a small, easily controlled but meaningful data set. Complexity can be built into databases over time. You may request the three databases we have developed by emailing the first author. When emailing, please include your name, institution/school, city, state/province, and country so that we may keep a record of requests. Databases used for this publication will be emailed to you as attached Excel files. Included in our databases are the compiled Baby-(all infants), Twin babies, and Obituary raw data collected from the La Crosse Tribune. These databases may be freely used for educational purposes, however, it is suggested that they be used as examples. It is preferable to build your databases using student-collected data. The data was not double checked for accuracy. Acknowledgements: The authors thank L. Gerber and two anonymous reviewers for comments on the original manuscript. References: American Association for the Advancement of Science (AAAS). (1990). Science for All Americans. New York: Oxford University Press. Capelle, J. & M. Smith. (1998). Using cemetery data to teach population biology & local history. The American Biology Teacher 60: 690-693. Frequency of Inherited Disorders Database (FIDD) (1999). http://archive.uwcm.ac.uk/uwcm/mg/fidd/index.html Guiry, M. D. & Nic Dhonncha, E. (2002). AlgaeBase. http://www.algaebase.org/default.html Gundersen-Lutheran Hospital’s On-Line Nursery (http://www.gundluth.org/babies) Kaiser, J. (2002). Population databases boom, from Iceland to the U.S. Science 298(5596): 1158-1161. 12 LaBare, K., R. Klotz, & E. Witherow. (2000). Using online databases to teach ecological concepts. The American Biology Teacher 62(2): 124-127. Morgan, L. (1997). Explorations: Statistics Handbook for the TI-83. Texas Instruments Inc. National Center for Health Statistics website. 2003. (http://www.cdc.gov/nchs/) National Council of Teachers of Mathematics (NCTM). (2000). Principles and Standards for School Mathematics. Reston, VA: The National Council of Teachers of Mathematics, Inc. National Research Council (NRC). (1996). National Science Education Standards. Washington D. C.: National Academy Press. -----. (2000). Inquiry and the National Science Education Standards: A Guide for Teaching and Learning. Washington D. C.: National Academy Press. National Oceanographic Data Center (NODC) (2001) http://www.nodc.noaa.gov/ Putterbaugh, M. & J. Burleigh. (2001). Investigating evolutionary questions using online molecular databases. The American Biology Teacher 6: 422-431. Shelly Cashman Series. (2000). Microsoft Office 2000: Introductory concepts and techniques. Cambridge, MA: Course Technology. Spooner, B. & J. Barracato. (1999). Database Basics Skills Book. Arlington, VA: National Science Teachers Association. SPSS for Windows, Rel. 11.5.1. 2002. Chicago: SPSS Inc. Wilcox, A.J. (2001). On the importance – and the unimportance – of birthweight. International Journal of Epidemiolgy 30: 1233-1241. Online at: http://eb.niehs.nih.gov/bwt/V0M3QDQU.pdf -----, R. Skjaerven, P. Buekens, & J. Kiely. (1995). Birth weight and perinatal mortality: A comparison of the United States and Norway. Journal of the American Medical Association 273: 709-711. 13 Table 1. Selected science* (NRC, 1996) and math+ (NCTM, 2000) standards relevant to this activity for K-12 grade levels. Grade 3-5 5-8 5-8 6-8 6-8 6-8 9-12 9-12 9-12 9-12 Standard Collect data using observations, surveys and experiments (p. 176) + …tools and techniques to gather, analyze, and interpret data ( p. 145) * Nature of science (p. 170) * Find, use, and interpret measures of center and spread, including mean and interquartile range (p. 248)+ Discuss and understand the correspondence between data sets and their graphical representations, especially histograms, stem-and-leaf plots, box plots, and scatterplots (p. 248)+ Use observations about differences between two or more samples to make conjectures about the populations from which the samples were taken (p. 248)+ Use technology and mathematics to improve investigations and communications (p. 175) * Understandings about scientific inquiry (p. 176) * Understand the meaning of measurement data and categorical data, of univariate and bivariate data, and of the term variable (p. 324)+ Understand how sample statistics reflect the values of population parameters and use sampling distributions as the basis for informal inference (p. 324)+ 14 Captions for Figures Figure 1. Example of the spreadsheet for the database of newborns (Baby (all infants) database) in La Crosse, WI. Figure 2. Boxplots of birth weight for newborns in La Crosse, WI. Circles represent outliers in the data. Figure 3. Scatterplot of birth weight vs. length for newborns in La Crosse, WI. Figure 4. Mean birth weight in ounces of male and female newborn twins and singles with 95% confidence intervals for newborns in La Crosse, WI. Figure 5. Histograms for ages of males and females using the Obituary database. Fig. 6. Example of an Excel spreadsheet with both raw (sex, year of death (YOD), year of birth (YOB)) and calculated (age = YOD - YOB) obituary data and an embedded table of statistics. 15 16 Weight (oz.) Newborns in La Crosse, WI 200 180 160 140 120 100 80 60 40 20 N= 56 69 Female Male SEX 17 Newborns in La Crosse, WI 200 Weight (oz.) 180 160 140 120 100 SEX 80 60 40 14 Male Female 16 18 20 Length (in.) 22 24 18 140 130 120 110 100 TY PE 90 80 Single 70 60 N= Twi n 56 28 69 Female 22 Male Sex 0 .0 .0 0 0. 13 0 0. 12 0 0. 1 1 .0 0 10 .0 90 .0 80 .0 70 .0 60 .0 50 .0 40 .0 30 20 10 0. 200 Frequency 200 Frequency 19 SEX= Female 400 300 100 0 AGE (Years) SEX= Male 400 300 100 0 0 .0 .0 .0 .0 .0 .0 .0 .0 .0 0 0. 13 0 0. 12 0 0. 1 1 .0 0 10 90 80 70 60 50 40 30 20 10 0. AGE (Years) 20 21 Verification This is to verify that our manuscript is neither being nor has been accepted for publication elsewhere. D. Timothy Gerber ________________________________________ David M. Reineke ________________________________________