Data Sets for Use in Statistic, Measurement and Design Courses Charles Stegman, Calli Holaway-Johnson, Sean Mulvenon, Sarah McKenzie, Ronna Turner, and Karen Morton University of Arkansas Paper presented at the Joint Statistical Meeting of the American Statistical Association, International Biometric Society, Institute of Mathematical Statistics, and Statistical Society of Canada Seattle, Washington August 2006 Data Sets for Use in Statistic, Measurement and Design Courses Abstract A major focus in teaching graduate level courses in statistics, measurement, and design should be the analysis of data. Results can be used to illustrate key concepts underlying the procedures discussed, help students learn how to analyze theoretical data in preparation for their careers, aid in interpreting and presenting research results, and contribute to preparing future researchers. This paper presents information on a multitude of data sets applicable for teaching courses at multiple levels and the accompanying CD contains the actual datasets. Background It is common for textbooks in statistics and research methodology to include a disk with several datasets that are used throughout the text. Glass and Hopkins (1996) is a good example, although others could be mentioned. Textbook datasets are commonly limited in terms of the number of datasets included and the number of cases within each dataset. The CD produced for this paper contains over 100 datasets from multiple fields, as well as Monte Carlo computer generated datasets. In addition, the datasets can be used across a range of courses from the introduction to research methodology and statistics through regression, ANOVA, multivariate, and advanced measurement. Development of the CD A first step was to locate publicly accessible datasets available on the web. These are datasets that can be downloaded and used in teaching so long as appropriate acknowledgement is given. For example, many researchers and professors have made their datasets available for public use through the StatLib library at Carnegie-Mellon University [http://lib.stat.cmu]. Three other helpful sites are the National Institute of Standards & Technology website [www.itl.nist.gov/div898/strd/general/dataarchive.html], the UCLA Statistics Lab website [www.ats.ucla.edu/stat], the Journal of Statistics Education Data Archive (www.amstat.org/publications/jse/jse_data_archive.html), and the DataFerrett [www.thedataweb.org]. The first site contains datasets that can used to test or demonstrate the accuracy and precision of different computer packages when analyzing statistical data. The UCLA site contains a wealth of statistical information and sample programs. The JSE Data Archive contains datasets that have been submitted by researchers around the world, and includes articles utilizing the datasets if available. The DataFerrett allows you to search multiple topics through data mining technology and select variables for different analyses. For the CD, selected datasets have been collected from these sites, with each dataset reviewed and included because it relates to topics regularly used as examples in statistics and research methodology courses. The datasets represent data from many fields of studies as do the examples in many of the textbooks. While professors and students can access any of these public domain datasets, the advantage of collecting them on a CD is that they are put into a standard format (Excel) and made readily available for uploading into numerous statistical 2 packages. This should facilitate their use by multiple users in a variety of courses. Each dataset includes variable descriptions as well as the bibliographic information from the original source. Additionally, samples from large scale datasets based on government sponsored research have been generated to support substantive based educational research examples. For example, census data and other government sponsored large scale research have produced datasets, such as the Early Childhood Longitudinal Study (ECLS-K), the National Longitudinal Study of Youth (NLSY), the National Household Education Survey (NHES), and the National Education Longitudinal Study (NELS). DataFerrett can also be used to access large scale databases. The following are some of the topics that are available from DataFerrett: Health Care, Child School Enrollment, Computer Ownership & Uses, Voting & Registration, Race & Ethnicity, School Enrollment, Teenage Attitudes & Practices, and Library Use. Note the DataFerrett allows you to search these and many more topics and select the variable sets you want. A third area where datasets have been generated is through Monte Carlo procedures. By specifying population parameters, we generated datasets that reflect educational settings and illustrate important statistical properties. Multivariate data are also generated that can be used in number of ways. For instance, variables can be selected for analysis in introductory courses and then revisited in more advanced courses like regression, design and multivariate statistics. The Structure of the CD Table 1 contains a list of the datasets contained on the CD. The title of each dataset is provided, as well as its name on the CD. The sample size and variables are also included. Finally, the original source for the data is given. Insert Table 1 The datasets have been reformed into Excel files. Many of the original files were in different formats and, while statisticians are adept at handing these, many students may still be learning basic data management. Especially in introductory classes, the emphasis is on data analyses using programs like SAS, SPSS, or R. Having the Excel files allows instructors the opportunity to write one set of instructions for importing data, allowing more time to concentrate on statistical analyses. The exception is the large scale datasets from the national databases which would be applicable to more advanced classes. Given the size of the datasets and the need for the weighting factors, Excel was too limiting. In this case, dBase and SAS data files were created. In more advanced classes, students could be expected to find, import, and clean data from the original sources. They could then analyze the data twice to make sure they get the same answers. Example of Using Some of the Datasets The dataset (Arkansas Math.xls) is based on simulated student data for grades 3-5 on the Arkansas Benchmark Mathematics Examination. The Arkansas Benchmark is a criterion- 3 referenced examination that consists of both multiple-choice and open-response questions. Tests for each grade level are developed to reflect content identified in the Arkansas state frameworks. The multiple-choice and open-response sections are weighted equally in determining a student’s score. In addition to their reported scaled scores, students are categorized as Below Basic, Basic, Proficient, or Advanced. Students with scaled scores of 200 or above are considered to be proficient and above 250 are considered to be advanced. The dataset contains 216 observations on 19 variables that would be available to school personnel. The observations were generated to reflect the actual variables used by the State of Arkansas for No Child Left Behind (NCLB) school assessments. Some of the ways we have used the Arkansas Math dataset include the following: the scaled scores can be used to demonstrate graphs (frequency distribution, frequency polygon, box plot and stem and leaf), measures of central tendency, variability, skewness, kurtosis and normality. Similarly, we have used the grade, gender and teacher variables to create subgroups for the same type of analyses. Several of the categorical variables are analyzed as well (demographics, crosstabs, and percentages). This is the material in the first five or six chapters in the introductory course. Students are required to create tables and figures using APA formats to help them in writing reports or articles. The Arkansas Math dataset is also used to demonstrate a multitude of different statistical inferential procedures. You can select data for t-tests, ANOVA (one-way and factorials), model assumptions, multiple comparisons, effect sizes, correlation, regression, and chi-square analyses. The multiple choice and open response scores as well as the strand scores reflect multivariate data. Another generated education dataset is Literacy Test.xls. This dataset was created to reflect data that would be available on many state criterion referenced tests that are given at different grade levels. It differs from the previous example in a couple of important ways. First, it is a larger dataset (5000 observations) and second, it includes individual student item scores tied to three stands that might be typical on a Literacy examination. The strands in this example are content, literacy, and practical. Each strand has 8 multiple-choice items (worth 2 points each) and an open-response item worth 16 points. Students receive a scaled score based on the points earned on the literacy items plus their response to a writing prompt. Other variables include gender, race, and free and reduced lunch participation. The same type of analyses mentioned above can be demonstrated with the dataset, but by having item data, a number of advanced measurement issues can also be discussed. A third example involves the two datasets based on the binomial distribution (Random Guessing.xls, 80% Mastery.xls). These datasets involve expected performance of 50 students on examinations worth 40 points. The first set assumes guessing and the second set involves “mastery learning.” Note that instructors could actually conduct a class exercise and create the first dataset by giving students answer sheets to fill out without giving them the questions. The instructor could have the students “score” their tests with a pre-assigned answer key. The instructor could also discuss why some national tests involve a correction for guessing. Simple SAS “proc univariate” analyses show the first distribution is positively skewed (p=0.2), while the 4 second is negatively skewed (p=0.8). Students could then practice merging the datasets and demonstrate a bi-modal distribution. A fourth example (Star.xls) is based on student data (sample size is 150) for the STAR Reading and STAR Math tests given during the first quarter of the school year and the SAT-9 (reading, literacy, and math) given in the spring. Student gender is also included so that there are six variables for each student. Instructors can use the data for descriptive statistical purposes as well as correlation and regression analyses (including the correlation matrix, multiple regression, and testing for bivariate normality). Note an instructor could also do simple procedures using the total data set, separate analyses for each gender, test for equality of correlations, parallelism of regression lines, ANCOVA and MANOVA. One use of such data might be identification of “atrisk” students and discuss potential interventions that might be used between October and May. The Diamond Pricing datasets provide an example of how different analyses may require reformatting of the datasets. With the Diamond Pricing.xls dataset, students may conduct univariate analyses. With the Diamond Pricing With Dummy Variables.xls dataset, students can perform more complicated analyses such as multiple regression. One valuable exercise might be to have students begin with the basic dataset and create the Data Set With Dummy Variables.xls by using a statistical package such as SAS, SPSS or R. Certain datasets allow for instructors to demonstrate various statistical concepts. For example, the Birth To Ten datasets are actual data that illustrate Simpson's paradox. The Baby Boom.xls dataset allows us to examine a variety of distributions, including binomial, Poisson, and exponential. These types of datasets can assist students in transitioning from a theoretical understanding to pragmatic application. In addition to their use in parametric statistical analyses, many of the datasets lend themselves to nonparametric analyses. A valuable exercise might be to have students analyze a dataset using both parametric and nonparametric procedures. The resulting discussion could focus on the importance of choosing the appropriate statistical analysis, as well as the impact of the violations of normality assumptions. Large Scale Datasets For large scale data analyses we have included the ECLS-K dataset. The Early Childhood Longitudinal Study – Kindergarten (ECLSK_sample) dataset is a subset of data from the ECLSKindergarten Class of 1998-99 (ECLS-K) Public Use Dataset (http://nces.ed.gov/ecls/) collected by the National Center for Education Statistics (West, http://nces.ed.gov/ecls/pdf/ ksum.pdf). The complete dataset is available for public use, and is located at the NCES website along with more detailed User’s Guide information, statistical documentation, and user resources. The complete dataset includes data on a nationally representative sample of about 21,260 children enrolled in both private and public full-day and partial day kindergarten programs in the academic year 1998-99. The type of data includes child and parent demographic, child academic and behavioral, family environment, and classroom and school demographic variables. 5 The data file included in this disk is a subset of 97 academic, behavioral, demographic, and family environment variables (with 6 sample weighting variables and their associated 540 replicate weights) for a total of 643 variables. All 21,260 students are included in the dataset, thus the ECLSK_sample dataset contains the same sampling properties of the original public use dataset. In the original sampling, oversampling occurred for select subgroups such as Asian students and students in private kindergarten programs (West, http://nces.ed.gov/ecls/pdf/ ksum.pdf). Thus, weighting variables are necessary for producing data that are representative of the 1998-99 national population. Additionally, the multi-stage sampling procedure used probability sampling from within primary sampling units. Because the sampling procedure allows for correlated samples, the within-group error variance is an underestimate of what would be found in the population, and subsequently, test statistics computed from the samples will be inflated. There are two common ways to adjust test statistics computed from the samples: the use of Design Effects or the use of re-estimation statistical packages such as SUDAAN (http://www.rti.org/sudaan/) or WestVar (http://www.westat.com/wesvar/). Design effect estimates can be found in the ECLS-K User’s Guide. The ECLSK_sample data file is recommended for use by students in moderate to advanced applied research methods and statistics courses; it is not recommended for students in introductory courses. The format of the variables requires students to utilize recoding procedures and provides opportunities for students to practice the creation of new variables by combining multiple related background and/or environmental variables. Weighting can be introduced to the students through the use of the sampling weights provided in the data file. Additionally, students can learn about the need for design effects with samples obtained by clustered or multi-stage sampling procedures and/or the use of jackknifing procedures with selection of the replicate weights provided. The types of variables allow for a variety of statistical procedures including nonparametric statistics, multiple regression, analysis of variance, analysis of covariance, and multivariate analysis of variance procedures. Professors teaching courses that include multiple regression, multivariate analysis, measurement and evaluation, and large-scale database analysis may find the data file useful for classroom examples and student practice. Additionally, professors will be able to create numerous smaller datasets from the data file for classroom use. Included in the ECLS-K folder are the data file in two formats (a dBase file and a SAS data file; an Excel file could not be used because of the 256 variable limit), a Microsoft© Word file of the variable codebook, and a SAS file listing the variable labels and format statements. The user will want to review the ECLS-K User’s Guide for more detailed information on sampling, data collection, variables, use of weights, design effects, and appropriate variance estimation procedures. The dBase (.dbf) file is recommended for use in WestVar. Monte Carlo Simulations If you have descriptive statistical information for a data set, but don’t actually have the data set, a very efficient method to help develop a practice or pilot research data set is through the use of Monte Carlo simulations. In Monte Carlo simulations a researcher uses the descriptive data to create “parallel” data sets that have the characteristics of the original data set. Further, the 6 researcher can create an unlimited number of cases and conditions associated with this original data set. The use of Monte Carlo simulations has traditionally been used in statistics and other related fields to evaluate the effectiveness of new methods and procedures. For example, a researcher develops a new statistical procedure, however this procedure needs to be checked under various conditions for discrepant sample size, normality and non-normality conditions. Collecting data or using archival data sets to evaluate the effectiveness of this new procedure under these various conditions would take a protracted amount of time. Further, issues of random sampling error for the archival data sets may also be problem. Thus, the researcher would use the collected and archival data sets and Monte Carlo simulations. A Monte Carlo simulation using the Stanford Achievement Test, Version 10 (SAT-10) data is demonstrated. Descriptive information for the SAT-10 7th grade spring administration of the exam has been selected. Descriptive information needed to conduct this type of Monte Carlo simulation are the means, standard deviations, and the correlations among all the variables (See Table 2). The variables selected for this simulation are Reading Vocabulary, Reading Comprehension, Reading Total, Math Concepts, Math Problem Solving, and Math Total. Table 2. Descriptive Statistics for SAT-10 7th Grade Spring Exam _____________________________________________________________________________ Correlations Variable Mean Std V1 V2 V3 V4 V5 V6 _____________________________________________________________________________ Reading: Vocabulary (V1) 669.4 39.1 1.00 . . . . . Comprehension (V2) 680.2 48.8 0.91 1.00 . . . . Total (V3) 663.3 39.1 0.96 0.78 1.00 . . . Math: Concepts (V4) 668.6 37.9 0.71 0.65 0.68 1.00 . . Problem Solving (V5) 666.2 37.6 0.69 0.64 0.66 0.95 1.00 . Total (V6) 672.2 48.1 0.64 0.57 0.62 0.93 0.77 1.00 _____________________________________________________________________________ Using the following sample program written in SAS version 9.2 (See Figure 1) you can complete a Monte Carlo simulation of the SAT-10 Grade 7th data provided in Table 2. A data set called SAT 10 Macro.xls with 10,000 observations, generated from using the macro in Figure 1 is available on the provided CD. This type of simulation process can also be extremely valuable for use in classroom environments. The last few lines of SAS code include a procedure called “Proc Surveyselect.” This procedure can be used to select random subsets of the data from the file SAT 10 Macro.xls. For this example, we have selected a sample of 200, with the data output to a file called “temp1.” This file, listed on the CD as Temp 1.xls, contains the 200 observations, randomly selected from SAT 10 Macro.xls. To confirm the macro is working effectively, the descriptive statistics for "temp1" are provided in Table 3. A comparison of the descriptive statistics from Table 2 with 7 Table 3 provides the necessary evidence to confirm that “temp1" is a representative sample of the SAT-10 7th Grade achievement data. Using Monte Carlo simulation procedures you can develop individualized data sets for students, complete pilot research work, or examine results for previous studies under the different conditions you place on the analyses. Table 3. Descriptive Statistics for Monte Carlo Sample of 200 for SAT-10 7th Grade Fall Exam _____________________________________________________________________________ Correlations Variable Mean Std V1 V2 V3 V4 V5 V6 _____________________________________________________________________________ Reading: Total (V1) 668.5 39.4 1.00 . . . . . Vocabulary (V2) 680.6 49.3 0.91 1.00 . . . . Comprehension (V3) 663.3 39.4 0.96 0.78 1.00 . . . Math: Total (V4) 668.3 37.9 0.72 0.65 0.69 1.00 . . Concepts (V5) 666.0 37.6 0.70 0.64 0.66 0.95 1.00 . Problem Solving (V6) 671.8 48.2 0.64 0.57 0.62 0.93 0.77 1.00 _____________________________________________________________________________ Sample printout from SAS Examples of some of the SAS printout for selected analyses are included in Appendix A. They include a univariate analysis, SAS graph, correlation, and an ANOVA. These demonstrate how a standard statistical program will generate examples for discussion in class. Conclusion and Distribution The paper discussed the contents and structure of the CD datasets as well as suggestions for how some of the datasets can be utilized. The CD is free and you may use it in your teaching. Again, proper credit must be given to the appropriate source. For instance, at StatLib they use the statement: “If you use an algorithm, dataset, or other information from StatLib, please acknowledge both StatLib and the original contributor of the material.” For the NCES datasets they prefer the following citation: National Center for Education Statistics, U.S. Department of Education. We hope these datasets will be helpful as you prepare your courses. We will continue to add additional datasets to the CD and will make them available to interested professionals. You may contact one of the authors at the University of Arkansas. 8 Table 1. Data Sets for Use in Statistic, Measurement, and Design Courses Title of Data Set 1993 New Car Data 1994 AAUP Faculty Salary Data 2004 New Car and Truck Data Name on CD 1993 Cars AAUP 2004 Cars n 93 1161 428 Variables in Data Set Manufacturer, Model, Type, Minimum price, Midrange price, Maximum price, City MPG, Highway MPG, Air bags standard, Drive train type, Number of cylinders, Engine size, Horsepower, RPM, Engine revolutions per mile, Manual transmission available, Fuel tank capacity, Passenger capacity, Length, Wheelbase, Width, U-turn space, Rear seat room, Luggage capacity, Weight, Domestic manufacturing Federal ID number, College Name, State, Type, Avg. salary—full professors, Avg. salary—associate professors, Avg. salary— assistant professors, Avg. salary—all ranks, Avg. compensation—full professors, Avg. compensation—associate professors, Avg. compensation—assistant professors, Avg. compensation—all ranks, Number of full professors, Number of associate professors, Number of assistant professors, Number of Instructors, Number of faculty—all ranks Vehicle name, Sports car, SUV, Wagon, Minivan, Pickup, All-wheel drive, Rearwheel drive, Suggested retail price, Dealer price, Engine size, Number of cylinders, Horsepower, City MPG, Highway MPG, Weight, Wheel base, Length, Width Source Consumer Reports: The 1993 CarsAnnual Auto Issue (April), Yonkers: Consumers Union. PACE New Car & Truck 1993 Buying Guide. Milwaukee: Pace Publications. Quoted in Lock, R. H. (1993). 1993 New Car Data. Journal of Statistics Education, 1(1). March-April 1994 issue of Academe. Submitted to the Journal of Statistics Education by Robin Lock. Kiplinger's Personal Finance, December 2003, vol. 57, no. 12, pp. 104-123, http:/www.kiplinger.com. Submitted to the Journal of Statistics Education by Roger W. Johnson Title of Data Set Name on CD n Variables in Data Set A Dataset That Is 44% Outliers Outlier 43 President name, Number of days in office Abortion Opinion Data Abortion Opinion 2385 Race, Gender, Age, Opinion Absentee and Machine Ballot Votes in Philadelphia Elections Advertising Pages and Advertising Revenue in 1986 Annual Data on Advertising, Promotions, Sales Expenses, and Sales Annual Return Rates in the Stock Market, 1976-1993 Attitude Survey Data Philadelphia Voting 22 Advertising Pages 41 Advertising Stock Market Employee Satisfaction 22 Year of election, District number, Democrat absentee vote in district, Republican absentee vote in district, Democrat machine vote in district, Republican machine vote in district Name of publication, Number of advertising pages in hundreds, Advertising revenue in millions of dollars Advertising expenditures, Promotion expenditures, Sales expense, Sales, Previous year's advertising expenditures, Previous year's promotion expenditures Source 2001 World Almanac. Quoted in Hayden, R. W. (2005). A dataset that is 44% outliers. Journal of Statistics Education, 13(1). Christensen, R. (1990). Log-linear models. New York: Springer-Verlag. Orley Ashenfelter. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. 18 Year, Standard and Poor’s Index year end value, Vanguard Index Trust 500 Portfolio year end value Vanguard Market Index Trust 500-Portfolio Annual Report, 1993 (p. 7). Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. 30 Overall rating of job being done by supervisor, Handles employee complaints, Does not allow special privileges, Opportunity to learn new things, Raises based on performances, Too critical of poor performances; Rate of advancing to better jobs Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. 10 Title of Data Set Average Monthly Air Temperature in Recife, Brazil, 19531962 Ball Bearing Reliability Data Name on CD Average Temperature Ball Bearings n Variables in Data Set Source 120 Month, Year, Average air temperature for a given month http://www.bath.ac.uk/~mascc/ Recife.TS 210 Company code, Test number, Year of test, Number of bearings, Load, Number of balls, Diameter, L10, L50, Weibull slope, Bearing type Lieblein and Zelen (1956). Statistical investigation of the fatigue life of deepgroove ball bearings. Quoted in Caroni (2002). Modeling the reliability of ball bearings. Journal of Statistics Education, 10(3). Baseline Data for Mayo Clinic Trial in Primary Biliary Cirrhosis (PBC) of the Liver Baseline Cirrhosis 418 Betting on Professional Football Results for 19891991 NFL 672 ID; Number of days between registration and the earlier of death, transplantion, or study analysis time in July, 1986; Death status; Drugs administered; Age; Sex; Fleming, T. R., & Harrington, D. P. Presence of ascites; Presence of (1991). Counting processes and hepatomegaly; Presence of spiders; survival analysis. New York: Wiley. Presence of edema; Serum bilirubin; Serum cholesterol; Albumin; Urine copper; Alkaline phosphatase; SGOT; Triglycerides; Platelets; Prothrombin time; Histologic stage of disease Compiled by Hal Stern. Submitted to the Statlib facility by Robin Lock. Name of favored team, Name of underdog Quoted in Chatterjee, S., Handcock, M. team, Betting result, Day and time of game, S., & Simonoff, J. S. (1995). A Favored team at home or away, Week of casebook for a first course in statistics season, Year and data analysis. New York: John Wiley. 11 Title of Data Set Name on CD n Variables in Data Set Birth to Ten Study: An Example of Simpson's Paradox Birth to Ten A (Note: This data set contains the same information as Birth to Ten B in a different format.) 1590 Medical aid given to mother, Mother traced for 5 year interview, Race, Frequency Birth to Ten Study: An Example of Simpson's Paradox Birth to Ten B (Note: This data set contains the same information as Birth to Ten A in a different format.) 1590 Medical aid given to mother, Mother traced for 5 year interview, Race 24 Taxes, Number of bathrooms, Lot size, Living space, Number of garage stalls, Number of rooms, Number of bedrooms, Age of the home, Number of fireplaces, Sale price Building Characteristics and Sales Price Property Valuation Calcium, Inorganic Phosphorus and Alkaline Phosphatase Levels in Elderly Patients Calcium (Note: This dataset intentionally has errors so that students 178 may practice cleaning data. The cleaned dataset is Calciumgood.) Patient observation number, Age in years, sex; Alkaline phosphatase international units/liter, Lab name, Calcium mmol/L, Inorganic phosphorus mmol/L, Age group 12 Source Chronic Diseases of Lifestyle Programme at the Medical Research Council in Cape Town, South Africa. Quoted in Morrell, C. H. (1999). Simpson's paradox: An example from a longitudinal study in South Africa. Journal of Statistics Education, 7(3). Chronic Diseases of Lifestyle Programme at the Medical Research Council in Cape Town, South Africa. Quoted in Morrell, C. H. (1999). Simpson's paradox: An example from a longitudinal study in South Africa. Journal of Statistics Education, 7(3). Narula, S. C., & Wellington, J. F. (1977). Technometrics, 19 (2). Quoted in Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Boyd, J., Delost, M., and Holcomb, J. (1998). Calcium, phosphorus, and alkaline phosphatase laboratory values of elderly subjects. Clinical Laboratory Science, 11. Quoted in Holcomb, J., and Spalsbury, A. (2005). Journal of Statistics Education, 13(3). Title of Data Set Calcium, Inorganic Phosphorus and Alkaline Phosphatase Levels in Elderly Patients--Cleaned Dataset Cigarette Consumption Data by State, 1970 Cloud Seeding Data Cloud-seeding Experiment in Tasmania Between Mid-1964 and January 1971 Name on CD Calciumgood Cigarette Consumption Cloud Seeding Rainfall n Variables in Data Set 178 Patient observation number, Age in years, sex; Alkaline phosphatase international units/liter, Lab name, Calcium mmol/L, Inorganic phosphorus mmol/L, Age group 51 State; Median age; Percentage of people over 25 years of age who had completed high school; Per capita personal income; Percentage of blacks; Percentage of females; Weighted average price of a pack of cigarettes; Number of packs of cigarettes sold on a per capita basis 24 Action, Day number, Seeding suitability, Echo coverage, Prewetness, Echo motion, Amount of rain 108 Period, Seeding status, Season, East target area rainfall, West target area rainfall, North control area rainfall, South control area rainfall, Northwest control area rainfall 13 Source Boyd, J., Delost, M., and Holcomb, J. (1998). Calcium, phosphorus, and alkaline phosphatase laboratory values of elderly subjects. Clinical Laboratory Science, 11. Quoted in Holcomb, J., and Spalsbury, A. (2005). Journal of Statistics Education, 13(3). Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Woodley, W. L., Simpson, J., Biondini, R., & Berkeley, J. (1977). Rainfall results 1970-75: Florida area cumulus experiment. Science, 195, 735-42. Quoted in Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall. Miller, A. J., Shaw, D. E., Veitch, L. G. & Smith, E. J. (1979). Analyzing the results of a cloud-seeding experiment in Tasmania. Communications in Statistics - Theory & Methods, vol. A8(10), 1017-1047. Title of Data Set Name on CD Comparison of Changes in Exchange Rates and Differences Exchange Rates in Inflation Rates for Various Countries Comparison of Health Care Spending Health Care Spending Across the United States n Variables in Data Set 44 Country name, Change in exchange rate 1975-1990, Change in exchange rate 19851990, Change in inflation rates 1975-1990, Change in inflation rates 1985-1990 50 State, Census Bureau region of the state, Census Bureau region number, Per capita health spending, Percent of per capita income spent on health Comparison of Productivity and Quality in Japanese and Non-Japanese Automobile Manufacturing Japanese Autos 27 Assembly defects per 100 cars, Hours per vehicle, National origin of facility, Assembly defects per 100 cars (nonJapanese origin), Assembly defects per 100 cars (Japanese origin), Hours per vehicle (non-Japanese origin), Hours per vehicle (Japanese origin) Consumer Expenditure and Money Stock 19521956 Consumer Expenditure 20 Quarter, Consumer expenditure, money stock 67 County, Type of voting machine used, Column format of ballot, Undervote count, Overvote count, Votes counted for Bush, Gore, Browne, Nader, Harris, Hagelin, Buchanan, McReynolds, Phillips, Moorehead, Chote, McCarthy County Data from the 2000 Presidential Election in Florida (Excluding Federal Absentee Votes) Florida Voting 2000 14 Source International Financial Statistics Yearbook. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. The New York Times. October 15, 1993. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. Womack, J. P., Jones, D. T., & Roos, D. (1990). The machine that changed the world. New York: Rawson. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. Friedman, M., & Meiselman, D. (1963). Commission on money and credit, stabilization policies. Englewood Cliffs, NJ: Prentice Hall. Quoted in Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. http://www.stat.ufl.edu/~presnell/fl200 0.txt Title of Data Set Data on French Economy; IMPORT Data (Billions of French Francs) Name on CD French Economy Diameter, Height, and Volume of Black Cherry Trees in Cherry Trees Allegheny National Forest, Pennsylvania Diamond Pricing with Diamond Pricing with Dummy Variables Dummy Variables Disposable Income and Ski Sales for Years 1964-1974 Ski Sales 1 n Variables in Data Set Source Malinvaud, E. (1968). Statistical methods in econometrics. Chicago: Rand McNally. Quoted in Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Ryan, T., Joiner, B., & Ryan, B. (1976). Minitab student handbook. North Scituate, MA: Duxbury Press. Quoted in Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall. 18 Year, Imports, Domestic production, Stock formation, Domestic consumption 31 Diameter, Height, Volume 308 Carat, Indicator for color D, Indicator for color E, Indicator for color F, Indicator for color G, Indicator for color H, Indicator for clarity IF, Indicator for clarity VVS1, Indicator for clarity VVS2, Indicator for clarity VS1, Indicator for certification body GIA, Indicator for certification body IGI, Indicator for medium stones, Indicator for large stones, Interaction variable med*carat, Interaction variable large*carat, Carat squared, Price in Singapore dollars, Ln(Price) Chu, S. (2001). Pricing the C's of diamond stones. Journal of Statistics Education, 9(2). 40 Quarter, Ski sales, Personal disposable income Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. 15 Title of Data Set Disposable Income, Ski Sales, and Seasonal Variables for Years 1964-1974 Distribution for Males and Females Born in Sweden in 1935 Name on CD Ski Sales 2 40 Swedish Birth Dates 12 Distribution of White Student Enrollment in White Enrollment Nassau County School Districts Dow Jones Industrial Average and the S & P 500 Index Values Weekly From February 1, 1991 to February 25, 1994 n Dow Jones 56 161 Variables in Data Set Source Chatterjee, S., & Price, B. (1991). Quarter, Ski sales, Personal disposable Regression analysis by example (2nd income, Season ed.). New York: John Wiley. Cramer, H. (1946). Mathematical methods of statistics. Princeton: Month, Number of females born, Number of Princeton University Press. Quoted in males born Christensen, R. (1990). Log-linear models. New York: Springer-Verlag. Newsday, May 20, 1994. Quoted in District, Proposed legislative district, Total Chatterjee, S., Handcock, M. S., & public school enrollment, White student Simonoff, J. S. (1995). A casebook for enrollment a first course in statistics and data analysis. New York: John Wiley. Date, Dow Jones Industrial Average at the close of the day, Standard and Poor’s 500 Stock Index at the close of the day Drill Bit Performance Over a Range of Drilling Conditions Drill Bit Data 31 Speed of rotation, Feed rate, Diameter of drill bit, Axial load on drill bit Drug Dosage Retained in Rat Livers Rat Data 19 Body weight, Liver weight, Relative dose, Percentage of dose retained in liver 16 Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. M. R. Delozier of Kennametal, Inc., Latrobe, Pennsylvania. Quoted in Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall. Weisberg, S. (1980). Applied Linear Regression. New York: Wiley. Quoted in Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall. Title of Data Set Name on CD n Variables in Data Set Early Childhood Longitudinal Study (ECLS-K) Data ECLSK_sample.sas7bda 21260 t See ECLSK_sample codebook.doc (643 variables available) Effectiveness of Blast Furnace Slags as Agricultural Liming Materials on Three Soil Types Agricultural Data 7 Treatment, Soil type, Corn yield 28 Date, Emergency road service calls answered, Forecast high temperature, Forecast low temperature, Daily high temperature, Daily low temperature, Rain forecast, Snow forecast, Type of day, Year, Sunday, Subzero temperature Emergency Calls to the New York Auto Club in January 1993 and January 1994 Equal Educational Opportunity (EEO) Data; Standardized Indexes Eruption Durations and Intereruption Times for the "Old Faithful" Geyser in Yellowstone National Park Auto Calls EEO Data Old Faithful 70 222 Source National Center for Education Statistics, U.S. Department of Education; accessed at http://nces.ed.gov/ Carter, O. R., Collier, B. L., & Davis, F. L. (1951). Blast furnace slags as agricultural liming materials. Agronomy Journal, 43, 430-433. Quoted in Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall. New York Motorist. (March 1994). Automobile Club of New York. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. Family, Peer, School Achievement Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Date, Duration of eruption, Time until next eruption Weisberg, S. (1985). Applied linear regression (2nd ed.). New York: John Wiley. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. 17 Title of Data Set Excretion of Steroids in Patients with Cushing's Syndrome Financial Ratios of Solvent and Bankrupt Firms Forced Expiratory Volume of Smokers and Non-smokers Name on CD Cushing’s Syndrome Financial Ratios FEV n Variables in Data Set 21 Type of Cushing’s syndrome, Levels of tetrahydrocortisone, Levels of pregnanetriol Source Aitchison, J., & Dunsmore, I. R. (1975). Statistical prediction analysis. Cambridge: Cambridge University Press. Quoted in Christensen, R. (1990). Log-linear models. New York: Springer-Verlag. 66 (working capital)/(total assets), (retained earnings)/(total assets), (earnings before interest and taxes)/(total assets), (marketvalue equity)/(book value of total liabilities), sales/(total assets), bankruptcy status Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. 654 Age, Forced Expiratory Volume (FEV), Height, Sex, Smoking status Fuel Consumption and Automotive Variables Fuel Consumption 30 Miles/gallon, Displacement, Horsepower, Torque, Compression ratio, Rear axle ratio, Carburetor (barrels), Number of transmission speeds, Overall length, Width, Weight, Type of transmission Gesell Adaptive Score and Age at First Word First Word 21 Age at first word, Gesell adaptive score 18 Rosner, B. (1999), Fundamentals of Biostatistics, 5th Ed., Pacific Grove, CA: Duxbury. Quoted in Kahn, M. (2005). An exhalent problem for teaching statistics. Journal of Statistics Education, 13(2). Motor Trend magazine, 1975. Quoted in Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Mickey, M. R., Dunn, O. J., & Clark, V. (1967). Note on the use of stepwise regression in detecting outliers. Computers & Biomedical Research, 1, 105-9. Quoted in Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall. Title of Data Set Graduate Admissions at Berkeley Name on CD n Variables in Data Set Berkeley Graduate Admissions 4526 Department, Gender, Admission status Jet Fighter Data Jet Fighter 22 Aircraft ID, First flight date, Specific power, Flight range factor, Payload, Sustained load factor, Carrier capability Lead Rating and News Rating of Television Data Television Ratings 30 Lead rating, News rating Service Calls 1 14 Units, Minutes Length of Computer Service Calls and Number of Units Repaired Length of Computer Service Calls and Number of Units Repaired--Expanded Sample Service Calls 2 24 Units, Minutes 19 Source Bickel, P. J., Hammel, E. A., & O'Conner, J. W. (1975). Sex bias in graduate admissions: Data from Berkeley. Science, 187, 398-404. Quoted in Christensen, R. (1990). Loglinear models. New York: SpringerVerlag. Stanley, W., & Miller, M. (1979). Measuring technological change in jet fighter aircraft. Report No. R-2249-AF. Santa Monica: Rand Corp. Quoted in Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall. Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Title of Data Set Name on CD n Length of Visits to msnbc.com on September 28, 1999 msnbclength 50,000 Leukemia Data for Patients Diagnosed as AG Positive Leukemia Data AG Positive 17 Leukemia Data for Patients Diagnosed as AG Positive or AG Negative Leukemia Data 30 Los Angeles Heart Study Data Chapman Data 200 Lug Counts from Vineyard Harvest by Row and Year of Harvest Lug Counts 52 Variables in Data Set Source Internet Information Server logs for msnbc.com and news-related portions of msn.com. Quoted by Sanchez, J. and Length of visit He, Y. (2005). Internet data analysis for the undergraduate statistics curriculum. Journal of Statistics Education, 13(3). Feigl, P., & Zelen, M. (1965). Estimation of exponential probabilities with concomitant information. White blood cell count, Survival time Biometrics, 21, 826-838. Quoted in Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York: Chapman and Hall. Feigl, P., & Zelen, M. (1965). Estimation of exponential probabilities White blood cell count, AG status, Number with concomitant information. of patients surviving at least 52 weeks, Biometrics, 21, 826-838. Quoted in Number of patients in each combination of Cook, R. D., & Weisberg, S. (1982). WBC and AG Residuals and influence in regression. New York: Chapman and Hall. Dixon, W. J., & Massey, F. J., Jr. (1983). Introduction to statistical Age, Systolic blood pressure, Diastolic analysis. New York: McGraw-Hill. blood pressure, Cholesterol, Height, Weight, Quoted in Christensen, R. (1990). LogCoronary incident linear models. New York: SpringerVerlag. Row number, Number of lugs for 1983, Barnhill family archives, 1976-1991. Number of lugs for 1984, Number of lugs Quoted in Chatterjee, S., Handcock, M. for 1985, Number of lugs for 1986, Number S., & Simonoff, J. S. (1995). A of lugs for 1987, Number of lugs for 1988, casebook for a first course in statistics Number of lugs for 1989, Number of lugs and data analysis. New York: John for 1990, Number of lugs for 1991 Wiley. 20 Title of Data Set Major League Baseball Hall of Fame Name on CD MLBHOF Mayo Clinic Trial in Primary Biliary Cirrhosis (PBC) of the Liver, 1974-1984 Cirrhosis Monte Carlo Simulation Sample Monte Carlo Simulation Program.doc n 1340 312 (data given for 1945 visits) 10,000 Variables in Data Set Player name, Number of seasons played, Games played, Official at-bats, Runs scored, Hits, Doubles, Triples, Home runs, Runs batted in, Walks, Strikeouts, Career batting average, On base percentage, Slugging percentage, Adjusted production, Batting runs, Adjusted batting runs, Runs created, Stolen bases, Times caught stealing, Stolen base runs, Fielding average, Fielding runs, Primary position played, Total player rating, Hall of Fame Status ID; Number of days between registration and the earlier of death, transplantion, or study analysis time in July, 1986; Death status; Drugs administered; Age; Sex; Number of days between enrollment and this visit date; Presence of ascites; Presence of hepatomegaly; Presence of spiders; Presence of edema; Serum bilirubin; Serum cholesterol; Albumin; Alkaline phosphatase; SGOT; Platelets; Prothrombin time; Histologic stage of disease 7th Grade SAT-10: Reading vocabulary, Reading comprehension, Reading total, Math concepts, Math problem solving, and Math total 21 Source The Baseball Encyclopedia and Total Baseball. Quoted in Cochran, J. (2000). Career records for all modern position players eligible for the Major League Baseball Hall of Fame. Journal of Statistics Education, 8(2). Fleming, T. R., & Harrington, D. P. (1991). Counting processes and survival analysis. New York: Wiley. Simulated data based on SAT-10 means, standard deviations, and correlations Title of Data Set Monthly Domestic Electricity Consumption at Different Temperatures Monthly Sunspots Numbers from 1740 to 1983 Number of Deaths by Horsekicks in the Prussian Army from 1875-1894 for 14 Corps Number of Supervised Workers and Supervisors in 27 Industrial Establishments Number of Surviving Bacteria Following Exposure to 200Kilovolt X-rays at 6minute Intervals Numbers of Reported Sexual Partners of a Sample of Males and Females Name on CD n Variables in Data Set Source Handcock family archives, August 1989-February 1994. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. Electricity 55 Month of observation, Year of observation, Average daily usage, Average daily temperature Sunspots 2820 Year, Number of sunspots per month (January-December) http://www.bath.ac.uk/~mascc/sunspots .TS Year, Corp1-Corp14, Total Andrews, D. F., & Herzberg, A. M. (1985). Data. Springer-Verlag: New York. Accessed at Statlib, http://lib.stat.cmu.edu/datasets/Andrews / Number of supervised workers, Number of supervisors Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Interval, Number of bacteria Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Male, Female The general social survey, 1989-1991. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. Horsekick Deaths Number of Supervised Workers Bacteria Death Rates Sexual Partners 20 27 15 3533 22 Title of Data Set Occupations of Family Heads for Families of Various Religious Groups Perceptions of the New York City Subway System Performance of National Basketball Association Guards Presidential Election Data, 1916-1988 Name on CD n Variables in Data Set Religion and Occupation 3966 Religious affiliation, Occupation, Number for each category New York Subway 62 Usage of subway, Cleanliness of stations, Cleanliness of trains, Safety in station, Safety on trains, Rush hour crowding in stations, Rush hour crowding on trains, Instation information, On-train announcements, Convenience of train stops, Convenience of train schedule, Speed of travel, Frequency of trains, Ease of token purchase, Ease of token collection, Police presence in stations, Police presence on trains, Availability of maps, Number of uses per week 105 Player’s name, Player’s height, Number of games appeared in, Total minutes played, Player’s age, Points scored per game, Assists per game, Rebounds per game, Percent of field goals made, Percent of free throws made 19 Year, Democratic share of the two-party vote, Party of incumbent, Party of incumbent running for election, Growth rate of real per capita GNP in the second and third quarters of the election year, Absolute value of the rate of inflation in the 2-year period prior to the election NBA Election 23 Source Lazerwitz, B. (1961). A comparison of major United States religious groups. Journal of the American Statistical Association, 56, 568-579. Quoted in Christensen, R. (1990). Log-linear models. New York: Springer-Verlag. Survey conducted at the Leonard N. Stern School of Business, Spring 1994. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. Cohn, J. (1994). The pro basketball bible. San Diego: Basketball Books Ltd. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. Fair, R. C. (1988). The effect of economic events on votes for president: 1984 update. Political Behavior, 10, 168-178. Quoted in Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Title of Data Set Pricing the C’s of Diamond Stones Relationship Between Instructor's Evaluation of General Intelligence, Quality of Clothing, and School Standard Relationship Between STAR Reading and Math and SAT-9 Reading, Math, and Language Salary Survey Data of Computer Professionals in a Large Corporation Sample of 200 Observations from SAT-10 Monte Carlo Simulation SAT-10 Monte Carlo Simulation Data Name on CD n Variables in Data Set Source Singapore's Business Times, February 18, 2000. Quoted in Chu, S. (2001). Pricing the C's of diamond stones. Journal of Statistics Education, 9(2). Gilby, W. H. (1911). On the significance of the teacher's appreciation of general intelligence. Biometrika, VII, 79-93. Quoted in Christensen, R. (1990). Log-linear models. New York: Springer-Verlag. Diamond Pricing 308 Carat, Color, Clarity, Certification body, Price in Singapore dollars Intelligence Clothing Standard 1725 Intelligence rating, Clothing rating, School standard, Number for each category (Dataset includes three partitioning tables) 150 Gender, STAR reading scaled score, STAR math scaled score, SAT-9 reading scaled score, SAT-9 math scaled score, SAT-9 language scaled score Randomly generated data Education, Experience, Management responsibility, Salary Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. STAR Salary of Computer Pros 46 Temp 1 200 SAT 10 Macro 10,000 Reading total score, Reading vocabulary score, Reading comprehension score, Math total score, Math concepts score, Math problem solving score Reading total score, Reading vocabulary score, Reading comprehension score, Math total score, Math concepts score, Math problem solving score 24 National Office for Research on Measurement and Evaluation Systems (NORMES), University of Arkansas National Office for Research on Measurement and Evaluation Systems (NORMES), University of Arkansas Title of Data Set Name on CD Scores for Students Expected to Reach 80% Mastery 80% Mastery Criterion on a 45 item Test with 5 Options Per Item Scores for Students with Random Guessing on a 45 Random Guessing Item Test with 5 Options Per Item n 50 50 Scores on a Multiple Choice and Open Response Literacy Exam Literacy Test 4999 Simulated Scores for Grades 3-5 on Arkansas Math Benchmark Exam Arkansas Math 216 Variables in Data Set Source ID, Score Randomly generated data based on the binomial distribution; corresponding data set found in Random Guessing.xls ID, Score Randomly generated data based on the binomial distribution; corresponding data set found in 80% Mastery.xls ID, Gender, Race, Free and reduced lunch participation, Performance class, Scaled score, Multiple choice items 1-24, Multiple choice scores for strands 1-3, Total multiple choice score, Open ended scores for strands 1-3, Total open ended score, Total raw score Special services code, Free and reduced price lunch participation, Limited English proficiency classification, Race, Gender, Grade, Math proficiency class, Mobility status, Multiple choice score, Open response score, Total math raw score, Teacher, Multiple choice and open response scores by 5 math strands (Number Sense, Geometry, Measurement, Data Analysis, and Patterns and Algebraic Functions), Total math scaled score 25 Randomly generated data National Office for Research on Measurement and Evaluation Systems (NORMES), University of Arkansas Title of Data Set Name on CD n Sleep in Mammals Animal Sleep 62 State Expenditures on Education State Education Expenditures 50 The Return on Stocks in Over the Counter Market and New York Stock Exchange, May 9May 13, 1994 Time of Birth, Sex, and Weight of 44 Babies Born in One Hospital in a 24 Hour Period NYSE OTC Baby Boom U.S. Airport Statistics Airports 30 Variables in Data Set Species of animal, Body weight, Brain weight, Slow wave ("nondreaming") sleep, Paradoxical ("dreaming") sleep, Total sleep, Maximum life span, Gestation time, Predation index, Sleep exposure index, Overall danger index State, Number of residents per thousand living in urban areas in 1970, Per capita expenditure on education projected for 1975, Per capita income in 1973, Number of residents per thousand under 18 years of age in 1974, Geographic region Weekly return of NASDAQ stocks, Weekly return of NYSE stocks 44 Time of birth, Sex, Birth Weight, Minutes after midnight of birth 135 Airport, City, Scheduled departures, Performed departures, Enplaned passengers, Enplaned revenue tons of frieght, Enplaned revenue tons of mail 26 Source Allison, T., & Cicchetti, D. V. (1976). Sleep in mammals: Ecological and constitutional correlates. Science, 194, 732-734. Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. Brisbane Sunday Mail, Dec. 21, 1997. Quoted in Dunn, P. (1999). A simple dataset for demonstrating common distributions. Journal of Statistics Education, 7(3). U.S. Federal Aviation Administration and Research and Special Programs Administration, 'Airport Activity Statistics' (1990). Submitted to the Journal of Statistics Education by Larry Winner. Title of Data Set Name on CD n Variables in Data Set Name of Senator, State of Senator, Vote on Article I, Vote on Article II, Number of votes for guilt, Political party affiliation, Degree of ideological conservativism, Percent of the vote Clinton received in 1996 in the Senator’s state, Year Senator is up for re-election, First-term Senator Source http://usatoday.com/news/index/clinton/ senvote2.htm, http://www.conservative.org/new_ratin gs/1997/97senate-preview.htm, http://www.vote-smart.org. Data compiled for the Journal of Statistics Education by Alan Reifman. http://www.bath.ac.uk/~mascc/Grubb.T S U.S. Senate Votes for Clinton Removal Impeachment 100 UK Total Monthly Air Passengers, 19491999 Air Passengers 612 Month, Year, Total number of monthly passengers 39 Birth month, Birth year, Length of longer foot, Width of longer foot, Gender, Foot measured, Left- or right-handedness Width and Length of Fourth Grade Students’ Feet Kid’s Feet Wind Chill Factor: Windspeed and Temperature Wind Chill 120 Actual air temperature, Wind speed, Wind chill factor (Variables presented in list and matrix format) Yearly Employment Rates in the U.S. of 25- to 34-Year Old Males with 9-11 Years of Schooling Percent Employed 20 Year, Percent of males employed Yield (%) on British short term government securities in successive months from about 1950 to about 1971 Government Securities 240 Year, Yield per month (January-December) 27 Meyer, M. C. (2006). Wider shoes for wider feet? Journal of Statistics Education, 14(1). Data collected by the author in a fourth grade classroom in Ann Arbor, MI. National Weather Service; Museum of Science of Boston. Quoted in Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley. The Condition of Education (1991). U.S. Department of Education. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. http://www.bath.ac.uk/~mascc/yield.TS Title of Data Set Yields from Vineyard Harvest by Row Number and Year of Harvest, 1983-1991 Name on CD Harvest Yield n 468 Variables in Data Set Harvest year, Row of vines, Yield of grapes 28 Source Barnhill family archives, 1976-1991. Quoted in Chatterjee, S., Handcock, M. S., & Simonoff, J. S. (1995). A casebook for a first course in statistics and data analysis. New York: John Wiley. Sample Monte Carlo Simulation Program data corr1(type= corr); infile cards missover; input _type_ $ _name_ $ v1-v6; cards; mean . 668.4 680.2 663.3 668.6 666.2 672.2 std . 39.1 48.8 39.1 37.9 37.6 48.1 n . 15000 15000 15000 15000 15000 15000 corr v1 1.00 corr v2 .91 1.00 corr v3 .96 .78 1.00 corr v4 .71 .65 .68 1.00 corr v5 .69 .64 .66 .95 1.00 corr v6 .64 .57 .62 .93 .77 1.00 ; run; proc factor data=corr1 nfact=6 outstat=t1 noprint; var v1-v6; run; title "Simulation Data for Classroom Models"; proc iml; start sim1; use work.t1; read all var {v1 v2 v3 v4 v5 v6} into x12; n=10000; x11= {668.4 680.2 663.3 668.6 666.2 672.2}; xx12= {39.1 48.8 39.1 37.9 37.6 48.1}; g11= x12[13:18,]`; a1= rannor(j(n, 6, 1)); a1_t= t(a1); s_hat= g11*a1_t; stand= t(s_hat); m1= x11[1,1]; m2= x11[1,2]; m3= x11[1,3]; m4= x11[1,4]; m5= x11[1,5]; m6= x11[1,6]; s1= xx12[1,1]; s2= xx12[1,2]; s3= xx12[1,3]; s4= xx12[1,4]; s5= xx12[1,5]; s6= xx12[1,6]; col_g1= m1 + s1*stand[,1]; col_g2= m2 + s2*stand[,2]; col_g3= m3 + s3*stand[,3]; col_g4= m4 + s4*stand[,4]; col_g5= m5 + s5*stand[,5]; col_g6= m6 + s6*stand[,6]; n_data= col_g1||col_g2||col_g3||col_g4||col_g5||col_g6; create sim1_data from n_data[colname= {x1 x2 x3 x4 x5 x6}]; append from n_data; finish sim1; run sim1; data sample; set sim1_data; x1= round(x1, 1); x2= round(x2, 1); x3= round(x3, 1); x4= round(x4, 1); x5= round(x5, 1); x6= round(x6, 1); run; proc corr data= sample; run; proc surveyselect data=sample sampsize= 200 out= temp1; run; 30 Appendix A Example Univariate Output for Arkansas Math.xls The UNIVARIATE Procedure s3MtScSc (Mathematics Scaled Score) Variable: Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 216 223.013889 88.260106 -0.3278101 12417619 39.576058 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 216 48171 7789.84632 -0.1406821 1674816.96 6.00533957 Basic Statistical Measures Location Mean Median Mode Variability 223.0139 226.0000 375.0000 Std Deviation Variance Range Interquartile Range 88.26011 7790 375.00000 115.50000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t Sign Signed Rank t M S Pr > |t| Pr >= |M| Pr >= |S| 37.13593 107 11502.5 <.0001 <.0001 <.0001 Quantiles (Definition 5) Quantile Estimate 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min 375.0 375.0 375.0 341.0 286.0 226.0 170.5 115.0 51.0 9.0 0.0 Example Univariate Output for Arkansas Math.xls Variable: The UNIVARIATE Procedure s3MtScSc (Mathematics Scaled Score) Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 0 0 9 10 14 159 113 146 133 134 375 375 375 375 375 168 171 183 198 214 31 Stem 36 34 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 Leaf 47255555555555555 12671 033684 12477789267788999 0035772345899 134447778899334578 012345612223359 00225669999123445677889 011268012334577889 0001122557788901333489 01123678012457899 0245799004569 02702456 3577 10347 614 718 364 009045 ----+----+----+----+--Multiply Stem.Leaf by 10**+1 # 17 5 6 17 13 18 15 23 18 22 17 13 8 4 5 3 3 3 6 Boxplot | | | | +-----+ | | | | *--+--* | | | | +-----+ | | | | | | | | Example Univariate Output for Arkansas Math.xls Variable: The UNIVARIATE Procedure s3MtScSc (Mathematics Scaled Score) Normal Probability Plot 370+ ****** * * | **+ | *** | **** | *** | *** | *** | **** | *** 190+ **** | ***+ | *** | **+ | *+ | +*** | ++** | ++ ** | +++ ** 10+*** *** +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 32 33 Example Correlation from Literacy Test.xls The CORR Procedure 3 Variables: Scaled_Score Total_Open_Ended Total_Multiple_Choice Simple Statistics Variable Scaled_Score Total_Open_Ended Total_Multiple_Choice N Mean Std Dev Sum Minimum Maximum 4999 4999 4999 212.60152 28.45009 30.25805 30.56291 13.00161 10.41199 1062795 142222 151260 14.00000 0 0 379.00000 48.00000 48.00000 Simple Statistics Variable Label Scaled_Score Total_Open_Ended Total_Multiple_Choice Scaled Score Total_Open_Ended Total_Multiple_Choice Pearson Correlation Coefficients, N = 4999 Prob > |r| under H0: Rho=0 Scaled_ Score Total_ Open_ Ended Total_ Multiple_ Choice Scaled_Score Scaled Score 1.00000 0.83332 <.0001 0.80558 <.0001 Total_Open_Ended Total_Open_Ended 0.83332 <.0001 1.00000 0.71297 <.0001 Total_Multiple_Choice Total_Multiple_Choice 0.80558 <.0001 0.71297 <.0001 1.00000 34 Example ANOVA Output for Literacy Test.xls The GLM Procedure Class Level Information Class Levels Race Values 5 African-American Asian/Pacific Islander Hispanic Other White Number of Observations Read Number of Observations Used 4999 4999 Example ANOVA Output for Literacy Test.xls The GLM Procedure Dependent Variable: Scaled_Score Scaled Score Source DF Sum of Squares Mean Square F Value Pr > F Model 4 314682.210 78670.553 90.24 <.0001 Error 4994 4353906.018 871.827 Corrected Total 4998 4668588.228 R-Square Coeff Var Root MSE Scaled_Score Mean 0.067404 13.88829 29.52672 212.6015 Source Race Source Race DF Type I SS Mean Square F Value Pr > F 4 314682.2102 78670.5526 90.24 <.0001 DF Type III SS Mean Square F Value Pr > F 4 314682.2102 78670.5526 90.24 <.0001 Example ANOVA Output for Literacy Test.xls The GLM Procedure Tukey's Studentized Range (HSD) Test for Scaled_Score NOTE: This test controls the Type I experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 4994 Error Mean Square 871.8274 Critical Value of Studentized Range 3.85915 Comparisons significant at the 0.05 level are indicated by ***. Difference Between Means Race Comparison Asian/Pacific Asian/Pacific Asian/Pacific Asian/Pacific White White White White Other Other Other Other Hispanic Hispanic Islander Islander Islander Islander - White Other Hispanic African-American Asian/Pacific Islander Other Hispanic African-American Asian/Pacific Islander White Hispanic African-American Asian/Pacific Islander White 35 3.7908 6.0878 18.1487 21.9304 -3.7908 2.2970 14.3579 18.1396 -6.0878 -2.2970 12.0609 15.8426 -18.1487 -14.3579 Simultaneous 95% Confidence Limits -7.8015 -8.1354 5.5481 10.1821 -15.3831 -6.1704 9.0501 15.4157 -20.3110 -10.7644 2.2583 7.1629 -30.7493 -19.6658 15.3831 20.3110 30.7493 33.6786 7.8015 10.7644 19.6658 20.8634 8.1354 6.1704 21.8636 24.5223 -5.5481 -9.0501 *** *** *** *** *** *** *** *** Hispanic Hispanic African-American African-American African-American African-American - Other African-American Asian/Pacific Islander White Other Hispanic -12.0609 3.7817 -21.9304 -18.1396 -15.8426 -3.7817 -21.8636 -1.8587 -33.6786 -20.8634 -24.5223 -9.4220 -2.2583 9.4220 -10.1821 -15.4157 -7.1629 1.8587 Example ANOVA Output for Literacy Test.xls The GLM Procedure Level of Race African-American Asian/Pacific Islander Hispanic Other White ---------Scaled_Score-------Mean Std Dev N 1174 49 247 93 3436 199.436968 221.367347 203.218623 215.279570 217.576542 36 26.8714035 31.5401159 29.1798596 31.5365094 30.3219353 *** *** *** ***