Data Sets for Use in Statistic

advertisement
Data Sets for Use in Statistic, Measurement and Design Courses
Charles Stegman, Calli Holaway-Johnson, Sean Mulvenon, Sarah McKenzie,
Ronna Turner, and Karen Morton
University of Arkansas
Paper presented at the Joint Statistical Meeting of the American Statistical Association,
International Biometric Society, Institute of Mathematical Statistics,
and Statistical Society of Canada
Seattle, Washington
August 2006
Data Sets for Use in Statistic, Measurement and Design Courses
Abstract
A major focus in teaching graduate level courses in statistics, measurement, and design should be
the analysis of data. Results can be used to illustrate key concepts underlying the procedures
discussed, help students learn how to analyze theoretical data in preparation for their careers, aid
in interpreting and presenting research results, and contribute to preparing future researchers.
This paper presents information on a multitude of data sets applicable for teaching courses at
multiple levels and the accompanying CD contains the actual datasets.
Background
It is common for textbooks in statistics and research methodology to include a disk with several
datasets that are used throughout the text. Glass and Hopkins (1996) is a good example,
although others could be mentioned. Textbook datasets are commonly limited in terms of the
number of datasets included and the number of cases within each dataset.
The CD produced for this paper contains over 100 datasets from multiple fields, as well as
Monte Carlo computer generated datasets. In addition, the datasets can be used across a range of
courses from the introduction to research methodology and statistics through regression,
ANOVA, multivariate, and advanced measurement.
Development of the CD
A first step was to locate publicly accessible datasets available on the web. These are datasets
that can be downloaded and used in teaching so long as appropriate acknowledgement is given.
For example, many researchers and professors have made their datasets available for public use
through the StatLib library at Carnegie-Mellon University [http://lib.stat.cmu]. Three other
helpful sites are the National Institute of Standards & Technology website
[www.itl.nist.gov/div898/strd/general/dataarchive.html], the UCLA Statistics Lab website
[www.ats.ucla.edu/stat], the Journal of Statistics Education Data Archive
(www.amstat.org/publications/jse/jse_data_archive.html), and the DataFerrett
[www.thedataweb.org]. The first site contains datasets that can used to test or demonstrate the
accuracy and precision of different computer packages when analyzing statistical data. The
UCLA site contains a wealth of statistical information and sample programs. The JSE Data
Archive contains datasets that have been submitted by researchers around the world, and
includes articles utilizing the datasets if available. The DataFerrett allows you to search multiple
topics through data mining technology and select variables for different analyses.
For the CD, selected datasets have been collected from these sites, with each dataset reviewed
and included because it relates to topics regularly used as examples in statistics and research
methodology courses. The datasets represent data from many fields of studies as do the
examples in many of the textbooks. While professors and students can access any of these
public domain datasets, the advantage of collecting them on a CD is that they are put into a
standard format (Excel) and made readily available for uploading into numerous statistical
2
packages. This should facilitate their use by multiple users in a variety of courses. Each dataset
includes variable descriptions as well as the bibliographic information from the original source.
Additionally, samples from large scale datasets based on government sponsored research have
been generated to support substantive based educational research examples. For example, census
data and other government sponsored large scale research have produced datasets, such as the
Early Childhood Longitudinal Study (ECLS-K), the National Longitudinal Study of Youth
(NLSY), the National Household Education Survey (NHES), and the National Education
Longitudinal Study (NELS). DataFerrett can also be used to access large scale databases. The
following are some of the topics that are available from DataFerrett: Health Care, Child School
Enrollment, Computer Ownership & Uses, Voting & Registration, Race & Ethnicity, School
Enrollment, Teenage Attitudes & Practices, and Library Use. Note the DataFerrett allows you to
search these and many more topics and select the variable sets you want.
A third area where datasets have been generated is through Monte Carlo procedures. By
specifying population parameters, we generated datasets that reflect educational settings and
illustrate important statistical properties. Multivariate data are also generated that can be used in
number of ways. For instance, variables can be selected for analysis in introductory courses and
then revisited in more advanced courses like regression, design and multivariate statistics.
The Structure of the CD
Table 1 contains a list of the datasets contained on the CD. The title of each dataset is provided,
as well as its name on the CD. The sample size and variables are also included. Finally, the
original source for the data is given.
Insert Table 1
The datasets have been reformed into Excel files. Many of the original files were in different
formats and, while statisticians are adept at handing these, many students may still be learning
basic data management. Especially in introductory classes, the emphasis is on data analyses
using programs like SAS, SPSS, or R. Having the Excel files allows instructors the opportunity
to write one set of instructions for importing data, allowing more time to concentrate on
statistical analyses. The exception is the large scale datasets from the national databases which
would be applicable to more advanced classes. Given the size of the datasets and the need for
the weighting factors, Excel was too limiting. In this case, dBase and SAS data files were
created.
In more advanced classes, students could be expected to find, import, and clean data from the
original sources. They could then analyze the data twice to make sure they get the same
answers.
Example of Using Some of the Datasets
The dataset (Arkansas Math.xls) is based on simulated student data for grades 3-5 on the
Arkansas Benchmark Mathematics Examination. The Arkansas Benchmark is a criterion-
3
referenced examination that consists of both multiple-choice and open-response questions. Tests
for each grade level are developed to reflect content identified in the Arkansas state frameworks.
The multiple-choice and open-response sections are weighted equally in determining a student’s
score. In addition to their reported scaled scores, students are categorized as Below Basic, Basic,
Proficient, or Advanced. Students with scaled scores of 200 or above are considered to be
proficient and above 250 are considered to be advanced. The dataset contains 216 observations
on 19 variables that would be available to school personnel. The observations were generated to
reflect the actual variables used by the State of Arkansas for No Child Left Behind (NCLB)
school assessments.
Some of the ways we have used the Arkansas Math dataset include the following: the scaled
scores can be used to demonstrate graphs (frequency distribution, frequency polygon, box plot
and stem and leaf), measures of central tendency, variability, skewness, kurtosis and normality.
Similarly, we have used the grade, gender and teacher variables to create subgroups for the same
type of analyses. Several of the categorical variables are analyzed as well (demographics,
crosstabs, and percentages). This is the material in the first five or six chapters in the
introductory course. Students are required to create tables and figures using APA formats to help
them in writing reports or articles.
The Arkansas Math dataset is also used to demonstrate a multitude of different statistical
inferential procedures. You can select data for t-tests, ANOVA (one-way and factorials), model
assumptions, multiple comparisons, effect sizes, correlation, regression, and chi-square analyses.
The multiple choice and open response scores as well as the strand scores reflect multivariate
data.
Another generated education dataset is Literacy Test.xls. This dataset was created to reflect data
that would be available on many state criterion referenced tests that are given at different grade
levels. It differs from the previous example in a couple of important ways. First, it is a larger
dataset (5000 observations) and second, it includes individual student item scores tied to three
stands that might be typical on a Literacy examination. The strands in this example are content,
literacy, and practical. Each strand has 8 multiple-choice items (worth 2 points each) and an
open-response item worth 16 points. Students receive a scaled score based on the points earned
on the literacy items plus their response to a writing prompt. Other variables include gender,
race, and free and reduced lunch participation. The same type of analyses mentioned above can
be demonstrated with the dataset, but by having item data, a number of advanced measurement
issues can also be discussed.
A third example involves the two datasets based on the binomial distribution (Random
Guessing.xls, 80% Mastery.xls). These datasets involve expected performance of 50 students on
examinations worth 40 points. The first set assumes guessing and the second set involves
“mastery learning.” Note that instructors could actually conduct a class exercise and create the
first dataset by giving students answer sheets to fill out without giving them the questions. The
instructor could have the students “score” their tests with a pre-assigned answer key. The
instructor could also discuss why some national tests involve a correction for guessing. Simple
SAS “proc univariate” analyses show the first distribution is positively skewed (p=0.2), while the
4
second is negatively skewed (p=0.8). Students could then practice merging the datasets and
demonstrate a bi-modal distribution.
A fourth example (Star.xls) is based on student data (sample size is 150) for the STAR Reading
and STAR Math tests given during the first quarter of the school year and the SAT-9 (reading,
literacy, and math) given in the spring. Student gender is also included so that there are six
variables for each student. Instructors can use the data for descriptive statistical purposes as well
as correlation and regression analyses (including the correlation matrix, multiple regression, and
testing for bivariate normality). Note an instructor could also do simple procedures using the
total data set, separate analyses for each gender, test for equality of correlations, parallelism of
regression lines, ANCOVA and MANOVA. One use of such data might be identification of “atrisk” students and discuss potential interventions that might be used between October and May.
The Diamond Pricing datasets provide an example of how different analyses may require
reformatting of the datasets. With the Diamond Pricing.xls dataset, students may conduct
univariate analyses. With the Diamond Pricing With Dummy Variables.xls dataset, students can
perform more complicated analyses such as multiple regression. One valuable exercise might be
to have students begin with the basic dataset and create the Data Set With Dummy Variables.xls
by using a statistical package such as SAS, SPSS or R.
Certain datasets allow for instructors to demonstrate various statistical concepts. For example,
the Birth To Ten datasets are actual data that illustrate Simpson's paradox. The Baby Boom.xls
dataset allows us to examine a variety of distributions, including binomial, Poisson, and
exponential. These types of datasets can assist students in transitioning from a theoretical
understanding to pragmatic application.
In addition to their use in parametric statistical analyses, many of the datasets lend themselves to
nonparametric analyses. A valuable exercise might be to have students analyze a dataset using
both parametric and nonparametric procedures. The resulting discussion could focus on the
importance of choosing the appropriate statistical analysis, as well as the impact of the violations
of normality assumptions.
Large Scale Datasets
For large scale data analyses we have included the ECLS-K dataset. The Early Childhood
Longitudinal Study – Kindergarten (ECLSK_sample) dataset is a subset of data from the ECLSKindergarten Class of 1998-99 (ECLS-K) Public Use Dataset (http://nces.ed.gov/ecls/) collected
by the National Center for Education Statistics (West, http://nces.ed.gov/ecls/pdf/
ksum.pdf). The complete dataset is available for public use, and is located at the NCES website
along with more detailed User’s Guide information, statistical documentation, and user
resources. The complete dataset includes data on a nationally representative sample of about
21,260 children enrolled in both private and public full-day and partial day kindergarten
programs in the academic year 1998-99. The type of data includes child and parent demographic,
child academic and behavioral, family environment, and classroom and school demographic
variables.
5
The data file included in this disk is a subset of 97 academic, behavioral, demographic, and
family environment variables (with 6 sample weighting variables and their associated 540
replicate weights) for a total of 643 variables. All 21,260 students are included in the dataset,
thus the ECLSK_sample dataset contains the same sampling properties of the original public use
dataset. In the original sampling, oversampling occurred for select subgroups such as Asian
students and students in private kindergarten programs (West, http://nces.ed.gov/ecls/pdf/
ksum.pdf). Thus, weighting variables are necessary for producing data that are representative of
the 1998-99 national population. Additionally, the multi-stage sampling procedure used
probability sampling from within primary sampling units. Because the sampling procedure
allows for correlated samples, the within-group error variance is an underestimate of what would
be found in the population, and subsequently, test statistics computed from the samples will be
inflated. There are two common ways to adjust test statistics computed from the samples: the use
of Design Effects or the use of re-estimation statistical packages such as SUDAAN
(http://www.rti.org/sudaan/) or WestVar (http://www.westat.com/wesvar/). Design effect
estimates can be found in the ECLS-K User’s Guide.
The ECLSK_sample data file is recommended for use by students in moderate to advanced
applied research methods and statistics courses; it is not recommended for students in
introductory courses. The format of the variables requires students to utilize recoding procedures
and provides opportunities for students to practice the creation of new variables by combining
multiple related background and/or environmental variables. Weighting can be introduced to the
students through the use of the sampling weights provided in the data file. Additionally, students
can learn about the need for design effects with samples obtained by clustered or multi-stage
sampling procedures and/or the use of jackknifing procedures with selection of the replicate
weights provided.
The types of variables allow for a variety of statistical procedures including nonparametric
statistics, multiple regression, analysis of variance, analysis of covariance, and multivariate
analysis of variance procedures. Professors teaching courses that include multiple regression,
multivariate analysis, measurement and evaluation, and large-scale database analysis may find
the data file useful for classroom examples and student practice. Additionally, professors will be
able to create numerous smaller datasets from the data file for classroom use.
Included in the ECLS-K folder are the data file in two formats (a dBase file and a SAS data file;
an Excel file could not be used because of the 256 variable limit), a Microsoft© Word file of the
variable codebook, and a SAS file listing the variable labels and format statements. The user will
want to review the ECLS-K User’s Guide for more detailed information on sampling, data
collection, variables, use of weights, design effects, and appropriate variance estimation
procedures. The dBase (.dbf) file is recommended for use in WestVar.
Monte Carlo Simulations
If you have descriptive statistical information for a data set, but don’t actually have the data set, a
very efficient method to help develop a practice or pilot research data set is through the use of
Monte Carlo simulations. In Monte Carlo simulations a researcher uses the descriptive data to
create “parallel” data sets that have the characteristics of the original data set. Further, the
6
researcher can create an unlimited number of cases and conditions associated with this original
data set.
The use of Monte Carlo simulations has traditionally been used in statistics and other related
fields to evaluate the effectiveness of new methods and procedures. For example, a researcher
develops a new statistical procedure, however this procedure needs to be checked under various
conditions for discrepant sample size, normality and non-normality conditions. Collecting data or
using archival data sets to evaluate the effectiveness of this new procedure under these various
conditions would take a protracted amount of time. Further, issues of random sampling error for
the archival data sets may also be problem. Thus, the researcher would use the collected and
archival data sets and Monte Carlo simulations.
A Monte Carlo simulation using the Stanford Achievement Test, Version 10 (SAT-10) data is
demonstrated. Descriptive information for the SAT-10 7th grade spring administration of the
exam has been selected. Descriptive information needed to conduct this type of Monte Carlo
simulation are the means, standard deviations, and the correlations among all the variables (See
Table 2). The variables selected for this simulation are Reading Vocabulary, Reading
Comprehension, Reading Total, Math Concepts, Math Problem Solving, and Math Total.
Table 2. Descriptive Statistics for SAT-10 7th Grade Spring Exam
_____________________________________________________________________________
Correlations
Variable
Mean
Std
V1
V2
V3
V4
V5
V6
_____________________________________________________________________________
Reading:
Vocabulary (V1)
669.4
39.1
1.00
.
.
.
.
.
Comprehension (V2)
680.2
48.8
0.91 1.00
.
.
.
.
Total (V3)
663.3
39.1
0.96 0.78 1.00
.
.
.
Math:
Concepts (V4)
668.6
37.9
0.71 0.65 0.68 1.00
.
.
Problem Solving (V5)
666.2
37.6
0.69 0.64 0.66 0.95 1.00
.
Total (V6)
672.2
48.1
0.64 0.57 0.62 0.93 0.77 1.00
_____________________________________________________________________________
Using the following sample program written in SAS version 9.2 (See Figure 1) you can complete
a Monte Carlo simulation of the SAT-10 Grade 7th data provided in Table 2. A data set called
SAT 10 Macro.xls with 10,000 observations, generated from using the macro in Figure 1 is
available on the provided CD.
This type of simulation process can also be extremely valuable for use in classroom
environments. The last few lines of SAS code include a procedure called “Proc Surveyselect.”
This procedure can be used to select random subsets of the data from the file SAT 10 Macro.xls.
For this example, we have selected a sample of 200, with the data output to a file called “temp1.”
This file, listed on the CD as Temp 1.xls, contains the 200 observations, randomly selected from
SAT 10 Macro.xls. To confirm the macro is working effectively, the descriptive statistics for
"temp1" are provided in Table 3. A comparison of the descriptive statistics from Table 2 with
7
Table 3 provides the necessary evidence to confirm that “temp1" is a representative sample of
the SAT-10 7th Grade achievement data.
Using Monte Carlo simulation procedures you can develop individualized data sets for students,
complete pilot research work, or examine results for previous studies under the different
conditions you place on the analyses.
Table 3. Descriptive Statistics for Monte Carlo Sample of 200 for SAT-10 7th Grade Fall Exam
_____________________________________________________________________________
Correlations
Variable
Mean
Std
V1
V2
V3
V4
V5
V6
_____________________________________________________________________________
Reading:
Total (V1)
668.5
39.4
1.00
.
.
.
.
.
Vocabulary (V2)
680.6
49.3
0.91 1.00
.
.
.
.
Comprehension (V3)
663.3
39.4
0.96 0.78 1.00
.
.
.
Math:
Total (V4)
668.3
37.9
0.72 0.65 0.69 1.00
.
.
Concepts (V5)
666.0
37.6
0.70 0.64 0.66 0.95 1.00
.
Problem Solving (V6)
671.8
48.2
0.64 0.57 0.62 0.93 0.77 1.00
_____________________________________________________________________________
Sample printout from SAS
Examples of some of the SAS printout for selected analyses are included in Appendix A. They
include a univariate analysis, SAS graph, correlation, and an ANOVA. These demonstrate how a
standard statistical program will generate examples for discussion in class.
Conclusion and Distribution
The paper discussed the contents and structure of the CD datasets as well as suggestions for how
some of the datasets can be utilized. The CD is free and you may use it in your teaching. Again,
proper credit must be given to the appropriate source. For instance, at StatLib they use the
statement: “If you use an algorithm, dataset, or other information from StatLib, please
acknowledge both StatLib and the original contributor of the material.” For the NCES datasets
they prefer the following citation: National Center for Education Statistics, U.S. Department of
Education.
We hope these datasets will be helpful as you prepare your courses. We will continue to add
additional datasets to the CD and will make them available to interested professionals. You may
contact one of the authors at the University of Arkansas.
8
Table 1. Data Sets for Use in Statistic, Measurement, and Design Courses
Title of Data Set
1993 New Car Data
1994 AAUP Faculty
Salary Data
2004 New Car and
Truck Data
Name on CD
1993 Cars
AAUP
2004 Cars
n
93
1161
428
Variables in Data Set
Manufacturer, Model, Type, Minimum
price, Midrange price, Maximum price, City
MPG, Highway MPG, Air bags standard,
Drive train type, Number of cylinders,
Engine size, Horsepower, RPM, Engine
revolutions per mile, Manual transmission
available, Fuel tank capacity, Passenger
capacity, Length, Wheelbase, Width, U-turn
space, Rear seat room, Luggage capacity,
Weight, Domestic manufacturing
Federal ID number, College Name, State,
Type, Avg. salary—full professors, Avg.
salary—associate professors, Avg. salary—
assistant professors, Avg. salary—all ranks,
Avg. compensation—full professors, Avg.
compensation—associate professors, Avg.
compensation—assistant professors, Avg.
compensation—all ranks, Number of full
professors, Number of associate professors,
Number of assistant professors, Number of
Instructors, Number of faculty—all ranks
Vehicle name, Sports car, SUV, Wagon,
Minivan, Pickup, All-wheel drive, Rearwheel drive, Suggested retail price, Dealer
price, Engine size, Number of cylinders,
Horsepower, City MPG, Highway MPG,
Weight, Wheel base, Length, Width
Source
Consumer Reports: The 1993 CarsAnnual Auto Issue (April), Yonkers:
Consumers Union. PACE New Car &
Truck 1993 Buying Guide. Milwaukee:
Pace Publications. Quoted in Lock, R.
H. (1993). 1993 New Car Data. Journal
of Statistics Education, 1(1).
March-April 1994 issue of Academe.
Submitted to the Journal of Statistics
Education by Robin Lock.
Kiplinger's Personal Finance,
December 2003, vol. 57, no. 12, pp.
104-123, http:/www.kiplinger.com.
Submitted to the Journal of Statistics
Education by Roger W. Johnson
Title of Data Set
Name on CD
n
Variables in Data Set
A Dataset That Is
44% Outliers
Outlier
43
President name, Number of days in office
Abortion Opinion
Data
Abortion Opinion
2385
Race, Gender, Age, Opinion
Absentee and
Machine Ballot Votes
in Philadelphia
Elections
Advertising Pages
and Advertising
Revenue in 1986
Annual Data on
Advertising,
Promotions, Sales
Expenses, and Sales
Annual Return Rates
in the Stock Market,
1976-1993
Attitude Survey Data
Philadelphia Voting
22
Advertising Pages
41
Advertising
Stock Market
Employee Satisfaction
22
Year of election, District number, Democrat
absentee vote in district, Republican
absentee vote in district, Democrat machine
vote in district, Republican machine vote in
district
Name of publication, Number of advertising
pages in hundreds, Advertising revenue in
millions of dollars
Advertising expenditures, Promotion
expenditures, Sales expense, Sales, Previous
year's advertising expenditures, Previous
year's promotion expenditures
Source
2001 World Almanac. Quoted in
Hayden, R. W. (2005). A dataset that is
44% outliers. Journal of Statistics
Education, 13(1).
Christensen, R. (1990). Log-linear
models. New York: Springer-Verlag.
Orley Ashenfelter. Quoted in
Chatterjee, S., Handcock, M. S., &
Simonoff, J. S. (1995). A casebook for
a first course in statistics and data
analysis. New York: John Wiley.
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.). New York: John Wiley.
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.). New York: John Wiley.
18
Year, Standard and Poor’s Index year end
value, Vanguard Index Trust 500 Portfolio
year end value
Vanguard Market Index Trust 500-Portfolio Annual Report, 1993 (p. 7).
Quoted in Chatterjee, S., Handcock, M.
S., & Simonoff, J. S. (1995). A
casebook for a first course in statistics
and data analysis. New York: John
Wiley.
30
Overall rating of job being done by
supervisor, Handles employee complaints,
Does not allow special privileges,
Opportunity to learn new things, Raises
based on performances, Too critical of poor
performances; Rate of advancing to better
jobs
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.). New York: John Wiley.
10
Title of Data Set
Average Monthly Air
Temperature in
Recife, Brazil, 19531962
Ball Bearing
Reliability Data
Name on CD
Average Temperature
Ball Bearings
n
Variables in Data Set
Source
120
Month, Year, Average air temperature for a
given month
http://www.bath.ac.uk/~mascc/
Recife.TS
210
Company code, Test number, Year of test,
Number of bearings, Load, Number of balls,
Diameter, L10, L50, Weibull slope, Bearing
type
Lieblein and Zelen (1956). Statistical
investigation of the fatigue life of deepgroove ball bearings. Quoted in Caroni
(2002). Modeling the reliability of ball
bearings. Journal of Statistics
Education, 10(3).
Baseline Data for
Mayo Clinic Trial in
Primary Biliary
Cirrhosis (PBC) of
the Liver
Baseline Cirrhosis
418
Betting on
Professional Football
Results for 19891991
NFL
672
ID; Number of days between registration
and the earlier of death, transplantion, or
study analysis time in July, 1986; Death
status; Drugs administered; Age; Sex;
Fleming, T. R., & Harrington, D. P.
Presence of ascites; Presence of
(1991). Counting processes and
hepatomegaly; Presence of spiders;
survival analysis. New York: Wiley.
Presence of edema; Serum bilirubin; Serum
cholesterol; Albumin; Urine copper;
Alkaline phosphatase; SGOT; Triglycerides;
Platelets; Prothrombin time; Histologic
stage of disease
Compiled by Hal Stern. Submitted to
the Statlib facility by Robin Lock.
Name of favored team, Name of underdog
Quoted in Chatterjee, S., Handcock, M.
team, Betting result, Day and time of game,
S., & Simonoff, J. S. (1995). A
Favored team at home or away, Week of
casebook for a first course in statistics
season, Year
and data analysis. New York: John
Wiley.
11
Title of Data Set
Name on CD
n
Variables in Data Set
Birth to Ten Study:
An Example of
Simpson's Paradox
Birth to Ten A (Note:
This data set contains
the same information as
Birth to Ten B in a
different format.)
1590
Medical aid given to mother, Mother traced
for 5 year interview, Race, Frequency
Birth to Ten Study:
An Example of
Simpson's Paradox
Birth to Ten B (Note:
This data set contains
the same information as
Birth to Ten A in a
different format.)
1590
Medical aid given to mother, Mother traced
for 5 year interview, Race
24
Taxes, Number of bathrooms, Lot size,
Living space, Number of garage stalls,
Number of rooms, Number of bedrooms,
Age of the home, Number of fireplaces,
Sale price
Building
Characteristics and
Sales Price
Property Valuation
Calcium, Inorganic
Phosphorus and
Alkaline Phosphatase
Levels in Elderly
Patients
Calcium (Note: This
dataset intentionally has
errors so that students
178
may practice cleaning
data. The cleaned
dataset is Calciumgood.)
Patient observation number, Age in years,
sex; Alkaline phosphatase international
units/liter, Lab name, Calcium mmol/L,
Inorganic phosphorus mmol/L, Age group
12
Source
Chronic Diseases of Lifestyle
Programme at the Medical Research
Council in Cape Town, South Africa.
Quoted in Morrell, C. H. (1999).
Simpson's paradox: An example from a
longitudinal study in South Africa.
Journal of Statistics Education, 7(3).
Chronic Diseases of Lifestyle
Programme at the Medical Research
Council in Cape Town, South Africa.
Quoted in Morrell, C. H. (1999).
Simpson's paradox: An example from a
longitudinal study in South Africa.
Journal of Statistics Education, 7(3).
Narula, S. C., & Wellington, J. F.
(1977). Technometrics, 19 (2). Quoted
in Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.).
New York: John Wiley.
Boyd, J., Delost, M., and Holcomb, J.
(1998). Calcium, phosphorus, and
alkaline phosphatase laboratory values
of elderly subjects. Clinical Laboratory
Science, 11. Quoted in Holcomb, J.,
and Spalsbury, A. (2005). Journal of
Statistics Education, 13(3).
Title of Data Set
Calcium, Inorganic
Phosphorus and
Alkaline Phosphatase
Levels in Elderly
Patients--Cleaned
Dataset
Cigarette
Consumption Data by
State, 1970
Cloud Seeding Data
Cloud-seeding
Experiment in
Tasmania Between
Mid-1964 and
January 1971
Name on CD
Calciumgood
Cigarette Consumption
Cloud Seeding
Rainfall
n
Variables in Data Set
178
Patient observation number, Age in years,
sex; Alkaline phosphatase international
units/liter, Lab name, Calcium mmol/L,
Inorganic phosphorus mmol/L, Age group
51
State; Median age; Percentage of people
over 25 years of age who had completed
high school; Per capita personal income;
Percentage of blacks; Percentage of
females; Weighted average price of a pack
of cigarettes; Number of packs of cigarettes
sold on a per capita basis
24
Action, Day number, Seeding suitability,
Echo coverage, Prewetness, Echo motion,
Amount of rain
108
Period, Seeding status, Season, East target
area rainfall, West target area rainfall, North
control area rainfall, South control area
rainfall, Northwest control area rainfall
13
Source
Boyd, J., Delost, M., and Holcomb, J.
(1998). Calcium, phosphorus, and
alkaline phosphatase laboratory values
of elderly subjects. Clinical Laboratory
Science, 11. Quoted in Holcomb, J.,
and Spalsbury, A. (2005). Journal of
Statistics Education, 13(3).
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.). New York: John Wiley.
Woodley, W. L., Simpson, J., Biondini,
R., & Berkeley, J. (1977). Rainfall
results 1970-75: Florida area cumulus
experiment. Science, 195, 735-42.
Quoted in Cook, R. D., & Weisberg, S.
(1982). Residuals and influence in
regression. New York: Chapman and
Hall.
Miller, A. J., Shaw, D. E., Veitch, L. G.
& Smith, E. J. (1979). Analyzing the
results of a cloud-seeding experiment in
Tasmania. Communications in
Statistics - Theory & Methods, vol.
A8(10), 1017-1047.
Title of Data Set
Name on CD
Comparison of
Changes in Exchange
Rates and Differences Exchange Rates
in Inflation Rates for
Various Countries
Comparison of Health
Care Spending
Health Care Spending
Across the United
States
n
Variables in Data Set
44
Country name, Change in exchange rate
1975-1990, Change in exchange rate 19851990, Change in inflation rates 1975-1990,
Change in inflation rates 1985-1990
50
State, Census Bureau region of the state,
Census Bureau region number, Per capita
health spending, Percent of per capita
income spent on health
Comparison of
Productivity and
Quality in Japanese
and Non-Japanese
Automobile
Manufacturing
Japanese Autos
27
Assembly defects per 100 cars, Hours per
vehicle, National origin of facility,
Assembly defects per 100 cars (nonJapanese origin), Assembly defects per 100
cars (Japanese origin), Hours per vehicle
(non-Japanese origin), Hours per vehicle
(Japanese origin)
Consumer
Expenditure and
Money Stock 19521956
Consumer Expenditure
20
Quarter, Consumer expenditure, money
stock
67
County, Type of voting machine used,
Column format of ballot, Undervote count,
Overvote count, Votes counted for Bush,
Gore, Browne, Nader, Harris, Hagelin,
Buchanan, McReynolds, Phillips,
Moorehead, Chote, McCarthy
County Data from the
2000 Presidential
Election in Florida
(Excluding Federal
Absentee Votes)
Florida Voting 2000
14
Source
International Financial Statistics
Yearbook. Quoted in Chatterjee, S.,
Handcock, M. S., & Simonoff, J. S.
(1995). A casebook for a first course in
statistics and data analysis. New York:
John Wiley.
The New York Times. October 15,
1993. Quoted in Chatterjee, S.,
Handcock, M. S., & Simonoff, J. S.
(1995). A casebook for a first course in
statistics and data analysis. New York:
John Wiley.
Womack, J. P., Jones, D. T., & Roos,
D. (1990). The machine that changed
the world. New York: Rawson. Quoted
in Chatterjee, S., Handcock, M. S., &
Simonoff, J. S. (1995). A casebook for
a first course in statistics and data
analysis. New York: John Wiley.
Friedman, M., & Meiselman, D. (1963).
Commission on money and credit,
stabilization policies. Englewood Cliffs,
NJ: Prentice Hall. Quoted in Chatterjee,
S., & Price, B. (1991). Regression
analysis by example (2nd ed.). New
York: John Wiley.
http://www.stat.ufl.edu/~presnell/fl200
0.txt
Title of Data Set
Data on French
Economy; IMPORT
Data (Billions of
French Francs)
Name on CD
French Economy
Diameter, Height, and
Volume of Black
Cherry Trees in
Cherry Trees
Allegheny National
Forest, Pennsylvania
Diamond Pricing with Diamond Pricing with
Dummy Variables
Dummy Variables
Disposable Income
and Ski Sales for
Years 1964-1974
Ski Sales 1
n
Variables in Data Set
Source
Malinvaud, E. (1968). Statistical
methods in econometrics. Chicago:
Rand McNally.
Quoted in Chatterjee, S., & Price, B.
(1991). Regression analysis by
example (2nd ed.). New York: John
Wiley.
Ryan, T., Joiner, B., & Ryan, B. (1976).
Minitab student handbook. North
Scituate, MA: Duxbury Press. Quoted
in Cook, R. D., & Weisberg, S. (1982).
Residuals and influence in regression.
New York: Chapman and Hall.
18
Year, Imports, Domestic production, Stock
formation, Domestic consumption
31
Diameter, Height, Volume
308
Carat, Indicator for color D, Indicator for
color E, Indicator for color F, Indicator for
color G, Indicator for color H, Indicator for
clarity IF, Indicator for clarity VVS1,
Indicator for clarity VVS2, Indicator for
clarity VS1, Indicator for certification body
GIA, Indicator for certification body IGI,
Indicator for medium stones, Indicator for
large stones, Interaction variable med*carat,
Interaction variable large*carat, Carat
squared, Price in Singapore dollars,
Ln(Price)
Chu, S. (2001). Pricing the C's of
diamond stones. Journal of Statistics
Education, 9(2).
40
Quarter, Ski sales, Personal disposable
income
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.).
New York: John Wiley.
15
Title of Data Set
Disposable Income,
Ski Sales, and
Seasonal Variables
for Years 1964-1974
Distribution for
Males and Females
Born in Sweden in
1935
Name on CD
Ski Sales 2
40
Swedish Birth Dates
12
Distribution of White
Student Enrollment in
White Enrollment
Nassau County
School Districts
Dow Jones Industrial
Average and the S &
P 500 Index Values
Weekly From
February 1, 1991 to
February 25, 1994
n
Dow Jones
56
161
Variables in Data Set
Source
Chatterjee, S., & Price, B. (1991).
Quarter, Ski sales, Personal disposable
Regression analysis by example (2nd
income, Season
ed.).
New York: John Wiley.
Cramer, H. (1946). Mathematical
methods of statistics. Princeton:
Month, Number of females born, Number of
Princeton University Press. Quoted in
males born
Christensen, R. (1990). Log-linear
models. New York: Springer-Verlag.
Newsday, May 20, 1994. Quoted in
District, Proposed legislative district, Total
Chatterjee, S., Handcock, M. S., &
public school enrollment, White student
Simonoff, J. S. (1995). A casebook for
enrollment
a first course in statistics and data
analysis. New York: John Wiley.
Date, Dow Jones Industrial Average at the
close of the day, Standard and Poor’s 500
Stock Index at the close of the day
Drill Bit Performance
Over a Range of
Drilling Conditions
Drill Bit Data
31
Speed of rotation, Feed rate, Diameter of
drill bit, Axial load on drill bit
Drug Dosage
Retained in Rat
Livers
Rat Data
19
Body weight, Liver weight, Relative dose,
Percentage of dose retained in liver
16
Chatterjee, S., Handcock, M. S., &
Simonoff, J. S. (1995). A casebook for
a first course in statistics and data
analysis. New York: John Wiley.
M. R. Delozier of Kennametal, Inc.,
Latrobe, Pennsylvania. Quoted in Cook,
R. D., & Weisberg, S. (1982).
Residuals and influence in regression.
New York: Chapman and Hall.
Weisberg, S. (1980). Applied Linear
Regression. New York: Wiley. Quoted
in Cook, R. D., & Weisberg, S. (1982).
Residuals and influence in regression.
New York: Chapman and Hall.
Title of Data Set
Name on CD
n
Variables in Data Set
Early Childhood
Longitudinal Study
(ECLS-K) Data
ECLSK_sample.sas7bda
21260
t
See ECLSK_sample codebook.doc (643
variables available)
Effectiveness of Blast
Furnace Slags as
Agricultural Liming
Materials on Three
Soil Types
Agricultural Data
7
Treatment, Soil type, Corn yield
28
Date, Emergency road service calls
answered, Forecast high temperature,
Forecast low temperature, Daily high
temperature, Daily low temperature, Rain
forecast, Snow forecast, Type of day, Year,
Sunday, Subzero temperature
Emergency Calls to
the New York Auto
Club in January 1993
and January 1994
Equal Educational
Opportunity (EEO)
Data; Standardized
Indexes
Eruption Durations
and Intereruption
Times for the "Old
Faithful" Geyser in
Yellowstone National
Park
Auto Calls
EEO Data
Old Faithful
70
222
Source
National Center for Education
Statistics, U.S. Department of
Education; accessed at
http://nces.ed.gov/
Carter, O. R., Collier, B. L., & Davis,
F. L. (1951). Blast furnace slags as
agricultural liming materials.
Agronomy Journal, 43, 430-433.
Quoted in Cook, R. D., & Weisberg, S.
(1982). Residuals and influence in
regression. New York: Chapman and
Hall.
New York Motorist. (March 1994).
Automobile Club of New York. Quoted
in Chatterjee, S., Handcock, M. S., &
Simonoff, J. S. (1995). A casebook for
a first course in statistics and data
analysis. New York: John Wiley.
Family, Peer, School Achievement
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.). New York: John Wiley.
Date, Duration of eruption, Time until next
eruption
Weisberg, S. (1985). Applied linear
regression (2nd ed.). New York: John
Wiley. Quoted in Chatterjee, S.,
Handcock, M. S., & Simonoff, J. S.
(1995). A casebook for a first course in
statistics and data analysis. New York:
John Wiley.
17
Title of Data Set
Excretion of Steroids
in Patients with
Cushing's Syndrome
Financial Ratios of
Solvent and Bankrupt
Firms
Forced Expiratory
Volume of Smokers
and Non-smokers
Name on CD
Cushing’s Syndrome
Financial Ratios
FEV
n
Variables in Data Set
21
Type of Cushing’s syndrome, Levels of
tetrahydrocortisone, Levels of pregnanetriol
Source
Aitchison, J., & Dunsmore, I. R.
(1975). Statistical prediction analysis.
Cambridge: Cambridge University
Press. Quoted in Christensen, R.
(1990). Log-linear models. New York:
Springer-Verlag.
66
(working capital)/(total assets), (retained
earnings)/(total assets), (earnings before
interest and taxes)/(total assets), (marketvalue equity)/(book value of total
liabilities), sales/(total assets), bankruptcy
status
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.). New York: John Wiley.
654
Age, Forced Expiratory Volume (FEV),
Height, Sex, Smoking status
Fuel Consumption
and Automotive
Variables
Fuel Consumption
30
Miles/gallon, Displacement, Horsepower,
Torque, Compression ratio, Rear axle ratio,
Carburetor (barrels), Number of
transmission speeds, Overall length, Width,
Weight, Type of transmission
Gesell Adaptive
Score and Age at
First Word
First Word
21
Age at first word, Gesell adaptive score
18
Rosner, B. (1999), Fundamentals of
Biostatistics, 5th Ed., Pacific Grove,
CA: Duxbury. Quoted in Kahn, M.
(2005). An exhalent problem for
teaching statistics. Journal of Statistics
Education, 13(2).
Motor Trend magazine, 1975. Quoted
in Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.).
New York: John Wiley.
Mickey, M. R., Dunn, O. J., & Clark,
V. (1967). Note on the use of stepwise
regression in detecting outliers.
Computers & Biomedical Research, 1,
105-9. Quoted in Cook, R. D., &
Weisberg, S. (1982). Residuals and
influence in regression. New York:
Chapman and Hall.
Title of Data Set
Graduate Admissions
at Berkeley
Name on CD
n
Variables in Data Set
Berkeley Graduate
Admissions
4526
Department, Gender, Admission status
Jet Fighter Data
Jet Fighter
22
Aircraft ID, First flight date, Specific
power, Flight range factor, Payload,
Sustained load factor, Carrier capability
Lead Rating and
News Rating of
Television Data
Television Ratings
30
Lead rating, News rating
Service Calls 1
14
Units, Minutes
Length of Computer
Service Calls and
Number of Units
Repaired
Length of Computer
Service Calls and
Number of Units
Repaired--Expanded
Sample
Service Calls 2
24
Units, Minutes
19
Source
Bickel, P. J., Hammel, E. A., &
O'Conner, J. W. (1975). Sex bias in
graduate admissions: Data from
Berkeley. Science, 187, 398-404.
Quoted in Christensen, R. (1990). Loglinear models. New York: SpringerVerlag.
Stanley, W., & Miller, M. (1979).
Measuring technological change in jet
fighter aircraft. Report No. R-2249-AF.
Santa Monica: Rand Corp. Quoted in
Cook, R. D., & Weisberg, S. (1982).
Residuals and influence in regression.
New York: Chapman and Hall.
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.).
New York: John Wiley.
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.).
New York: John Wiley.
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.).
New York: John Wiley.
Title of Data Set
Name on CD
n
Length of Visits to
msnbc.com on
September 28, 1999
msnbclength
50,000
Leukemia Data for
Patients Diagnosed as
AG Positive
Leukemia Data AG
Positive
17
Leukemia Data for
Patients Diagnosed as
AG Positive or AG
Negative
Leukemia Data
30
Los Angeles Heart
Study Data
Chapman Data
200
Lug Counts from
Vineyard Harvest by
Row and Year of
Harvest
Lug Counts
52
Variables in Data Set
Source
Internet Information Server logs for
msnbc.com and news-related portions
of msn.com. Quoted by Sanchez, J. and
Length of visit
He, Y. (2005). Internet data analysis for
the undergraduate statistics curriculum.
Journal of Statistics Education, 13(3).
Feigl, P., & Zelen, M. (1965).
Estimation of exponential probabilities
with concomitant information.
White blood cell count, Survival time
Biometrics, 21, 826-838. Quoted in
Cook, R. D., & Weisberg, S. (1982).
Residuals and influence in regression.
New York: Chapman and Hall.
Feigl, P., & Zelen, M. (1965).
Estimation of exponential probabilities
White blood cell count, AG status, Number
with concomitant information.
of patients surviving at least 52 weeks,
Biometrics, 21, 826-838. Quoted in
Number of patients in each combination of
Cook, R. D., & Weisberg, S. (1982).
WBC and AG
Residuals and influence in regression.
New York: Chapman and Hall.
Dixon, W. J., & Massey, F. J., Jr.
(1983). Introduction to statistical
Age, Systolic blood pressure, Diastolic
analysis. New York: McGraw-Hill.
blood pressure, Cholesterol, Height, Weight,
Quoted in Christensen, R. (1990). LogCoronary incident
linear models. New York: SpringerVerlag.
Row number, Number of lugs for 1983,
Barnhill family archives, 1976-1991.
Number of lugs for 1984, Number of lugs
Quoted in Chatterjee, S., Handcock, M.
for 1985, Number of lugs for 1986, Number S., & Simonoff, J. S. (1995). A
of lugs for 1987, Number of lugs for 1988,
casebook for a first course in statistics
Number of lugs for 1989, Number of lugs
and data analysis. New York: John
for 1990, Number of lugs for 1991
Wiley.
20
Title of Data Set
Major League
Baseball Hall of
Fame
Name on CD
MLBHOF
Mayo Clinic Trial in
Primary Biliary
Cirrhosis (PBC) of
the Liver, 1974-1984
Cirrhosis
Monte Carlo
Simulation
Sample Monte Carlo
Simulation Program.doc
n
1340
312
(data
given
for
1945
visits)
10,000
Variables in Data Set
Player name, Number of seasons played,
Games played, Official at-bats, Runs scored,
Hits, Doubles, Triples, Home runs, Runs
batted in, Walks, Strikeouts, Career batting
average, On base percentage, Slugging
percentage, Adjusted production, Batting
runs, Adjusted batting runs, Runs created,
Stolen bases, Times caught stealing, Stolen
base runs, Fielding average, Fielding runs,
Primary position played, Total player rating,
Hall of Fame Status
ID; Number of days between registration
and the earlier of death, transplantion, or
study analysis time in July, 1986; Death
status; Drugs administered; Age; Sex;
Number of days between enrollment and
this visit date; Presence of ascites; Presence
of hepatomegaly; Presence of spiders;
Presence of edema; Serum bilirubin; Serum
cholesterol; Albumin; Alkaline phosphatase;
SGOT; Platelets; Prothrombin time;
Histologic stage of disease
7th Grade SAT-10: Reading vocabulary,
Reading comprehension, Reading total,
Math concepts, Math problem solving, and
Math total
21
Source
The Baseball Encyclopedia and Total
Baseball. Quoted in Cochran, J. (2000).
Career records for all modern position
players eligible for the Major League
Baseball Hall of Fame. Journal of
Statistics Education, 8(2).
Fleming, T. R., & Harrington, D. P.
(1991). Counting processes and
survival analysis. New York: Wiley.
Simulated data based on SAT-10
means, standard deviations, and
correlations
Title of Data Set
Monthly Domestic
Electricity
Consumption at
Different
Temperatures
Monthly Sunspots
Numbers from 1740
to 1983
Number of Deaths by
Horsekicks in the
Prussian Army from
1875-1894 for 14
Corps
Number of
Supervised Workers
and Supervisors in 27
Industrial
Establishments
Number of Surviving
Bacteria Following
Exposure to 200Kilovolt X-rays at 6minute Intervals
Numbers of Reported
Sexual Partners of a
Sample of Males and
Females
Name on CD
n
Variables in Data Set
Source
Handcock family archives, August
1989-February 1994. Quoted in
Chatterjee, S., Handcock, M. S., &
Simonoff, J. S. (1995). A casebook for
a first course in statistics and data
analysis. New York: John Wiley.
Electricity
55
Month of observation, Year of observation,
Average daily usage, Average daily
temperature
Sunspots
2820
Year, Number of sunspots per month
(January-December)
http://www.bath.ac.uk/~mascc/sunspots
.TS
Year, Corp1-Corp14, Total
Andrews, D. F., & Herzberg, A. M.
(1985). Data. Springer-Verlag: New
York. Accessed at Statlib,
http://lib.stat.cmu.edu/datasets/Andrews
/
Number of supervised workers, Number of
supervisors
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.).
New York: John Wiley.
Interval, Number of bacteria
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.). New York: John Wiley.
Male, Female
The general social survey, 1989-1991.
Quoted in Chatterjee, S., Handcock, M.
S., & Simonoff, J. S. (1995). A
casebook for a first course in statistics
and data analysis. New York: John
Wiley.
Horsekick Deaths
Number of Supervised
Workers
Bacteria Death Rates
Sexual Partners
20
27
15
3533
22
Title of Data Set
Occupations of
Family Heads for
Families of Various
Religious Groups
Perceptions of the
New York City
Subway System
Performance of
National Basketball
Association Guards
Presidential Election
Data, 1916-1988
Name on CD
n
Variables in Data Set
Religion and Occupation 3966
Religious affiliation, Occupation, Number
for each category
New York Subway
62
Usage of subway, Cleanliness of stations,
Cleanliness of trains, Safety in station,
Safety on trains, Rush hour crowding in
stations, Rush hour crowding on trains, Instation information, On-train
announcements, Convenience of train stops,
Convenience of train schedule, Speed of
travel, Frequency of trains, Ease of token
purchase, Ease of token collection, Police
presence in stations, Police presence on
trains, Availability of maps, Number of uses
per week
105
Player’s name, Player’s height, Number of
games appeared in, Total minutes played,
Player’s age, Points scored per game,
Assists per game, Rebounds per game,
Percent of field goals made, Percent of free
throws made
19
Year, Democratic share of the two-party
vote, Party of incumbent, Party of
incumbent running for election, Growth rate
of real per capita GNP in the second and
third quarters of the election year, Absolute
value of the rate of inflation in the 2-year
period prior to the election
NBA
Election
23
Source
Lazerwitz, B. (1961). A comparison of
major United States religious groups.
Journal of the American Statistical
Association, 56, 568-579. Quoted in
Christensen, R. (1990). Log-linear
models. New York: Springer-Verlag.
Survey conducted at the Leonard N.
Stern School of Business, Spring 1994.
Quoted in Chatterjee, S., Handcock, M.
S., & Simonoff, J. S. (1995). A
casebook for a first course in statistics
and data analysis. New York: John
Wiley.
Cohn, J. (1994). The pro basketball
bible. San Diego: Basketball Books
Ltd. Quoted in Chatterjee, S.,
Handcock, M. S., & Simonoff, J. S.
(1995). A casebook for a first course in
statistics and data analysis. New York:
John Wiley.
Fair, R. C. (1988). The effect of
economic events on votes for president:
1984 update. Political Behavior, 10,
168-178. Quoted in Chatterjee, S., &
Price, B. (1991). Regression analysis
by example (2nd ed.). New York: John
Wiley.
Title of Data Set
Pricing the C’s of
Diamond Stones
Relationship Between
Instructor's
Evaluation of General
Intelligence, Quality
of Clothing, and
School Standard
Relationship Between
STAR Reading and
Math and SAT-9
Reading, Math, and
Language
Salary Survey Data of
Computer
Professionals in a
Large Corporation
Sample of 200
Observations from
SAT-10 Monte Carlo
Simulation
SAT-10 Monte Carlo
Simulation Data
Name on CD
n
Variables in Data Set
Source
Singapore's Business Times, February
18, 2000. Quoted in Chu, S. (2001).
Pricing the C's of diamond stones.
Journal of Statistics Education, 9(2).
Gilby, W. H. (1911). On the
significance of the teacher's
appreciation of general intelligence.
Biometrika, VII, 79-93. Quoted in
Christensen, R. (1990). Log-linear
models. New York: Springer-Verlag.
Diamond Pricing
308
Carat, Color, Clarity, Certification body,
Price in Singapore dollars
Intelligence Clothing
Standard
1725
Intelligence rating, Clothing rating, School
standard, Number for each category
(Dataset includes three partitioning tables)
150
Gender, STAR reading scaled score, STAR
math scaled score, SAT-9 reading scaled
score, SAT-9 math scaled score, SAT-9
language scaled score
Randomly generated data
Education, Experience, Management
responsibility, Salary
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.).
New York: John Wiley.
STAR
Salary of Computer Pros 46
Temp 1
200
SAT 10 Macro
10,000
Reading total score, Reading vocabulary
score, Reading comprehension score, Math
total score, Math concepts score, Math
problem solving score
Reading total score, Reading vocabulary
score, Reading comprehension score, Math
total score, Math concepts score, Math
problem solving score
24
National Office for Research on
Measurement and Evaluation Systems
(NORMES), University of Arkansas
National Office for Research on
Measurement and Evaluation Systems
(NORMES), University of Arkansas
Title of Data Set
Name on CD
Scores for Students
Expected to Reach
80% Mastery
80% Mastery
Criterion on a 45 item
Test with 5 Options
Per Item
Scores for Students
with Random
Guessing on a 45
Random Guessing
Item Test with 5
Options Per Item
n
50
50
Scores on a Multiple
Choice and Open
Response Literacy
Exam
Literacy Test
4999
Simulated Scores for
Grades 3-5 on
Arkansas Math
Benchmark Exam
Arkansas Math
216
Variables in Data Set
Source
ID, Score
Randomly generated data based on the
binomial distribution; corresponding
data set found in Random Guessing.xls
ID, Score
Randomly generated data based on the
binomial distribution; corresponding
data set found in 80% Mastery.xls
ID, Gender, Race, Free and reduced lunch
participation, Performance class, Scaled
score, Multiple choice items 1-24, Multiple
choice scores for strands 1-3, Total multiple
choice score, Open ended scores for strands
1-3, Total open ended score, Total raw score
Special services code, Free and reduced
price lunch participation, Limited English
proficiency classification, Race, Gender,
Grade, Math proficiency class, Mobility
status, Multiple choice score, Open response
score, Total math raw score, Teacher,
Multiple choice and open response scores
by 5 math strands (Number Sense,
Geometry, Measurement, Data Analysis,
and Patterns and Algebraic Functions),
Total math scaled score
25
Randomly generated data
National Office for Research on
Measurement and Evaluation Systems
(NORMES), University of Arkansas
Title of Data Set
Name on CD
n
Sleep in Mammals
Animal Sleep
62
State Expenditures on
Education
State Education
Expenditures
50
The Return on Stocks
in Over the Counter
Market and New
York Stock
Exchange, May 9May 13, 1994
Time of Birth, Sex,
and Weight of 44
Babies Born in One
Hospital in a 24 Hour
Period
NYSE OTC
Baby Boom
U.S. Airport Statistics Airports
30
Variables in Data Set
Species of animal, Body weight, Brain
weight, Slow wave ("nondreaming") sleep,
Paradoxical ("dreaming") sleep, Total sleep,
Maximum life span, Gestation time,
Predation index, Sleep exposure index,
Overall danger index
State, Number of residents per thousand
living in urban areas in 1970, Per capita
expenditure on education projected for
1975, Per capita income in 1973, Number of
residents per thousand under 18 years of age
in 1974, Geographic region
Weekly return of NASDAQ stocks, Weekly
return of NYSE stocks
44
Time of birth, Sex, Birth Weight, Minutes
after midnight of birth
135
Airport, City, Scheduled departures,
Performed departures, Enplaned passengers,
Enplaned revenue tons of frieght, Enplaned
revenue tons of mail
26
Source
Allison, T., & Cicchetti, D. V. (1976).
Sleep in mammals: Ecological and
constitutional correlates. Science, 194,
732-734.
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.). New York: John Wiley.
Chatterjee, S., Handcock, M. S., &
Simonoff, J. S. (1995). A casebook for
a first course in statistics and data
analysis. New York: John Wiley.
Brisbane Sunday Mail, Dec. 21, 1997.
Quoted in Dunn, P. (1999). A simple
dataset for demonstrating common
distributions. Journal of Statistics
Education, 7(3).
U.S. Federal Aviation Administration
and Research and Special Programs
Administration, 'Airport Activity
Statistics' (1990). Submitted to the
Journal of Statistics Education by Larry
Winner.
Title of Data Set
Name on CD
n
Variables in Data Set
Name of Senator, State of Senator, Vote on
Article I, Vote on Article II, Number of
votes for guilt, Political party affiliation,
Degree of ideological conservativism,
Percent of the vote Clinton received in 1996
in the Senator’s state, Year Senator is up for
re-election, First-term Senator
Source
http://usatoday.com/news/index/clinton/
senvote2.htm,
http://www.conservative.org/new_ratin
gs/1997/97senate-preview.htm,
http://www.vote-smart.org. Data
compiled for the Journal of Statistics
Education by Alan Reifman.
http://www.bath.ac.uk/~mascc/Grubb.T
S
U.S. Senate Votes for
Clinton Removal
Impeachment
100
UK Total Monthly
Air Passengers, 19491999
Air Passengers
612
Month, Year, Total number of monthly
passengers
39
Birth month, Birth year, Length of longer
foot, Width of longer foot, Gender, Foot
measured, Left- or right-handedness
Width and Length of
Fourth Grade
Students’ Feet
Kid’s Feet
Wind Chill Factor:
Windspeed and
Temperature
Wind Chill
120
Actual air temperature, Wind speed, Wind
chill factor (Variables presented in list and
matrix format)
Yearly Employment
Rates in the U.S. of
25- to 34-Year Old
Males with 9-11
Years of Schooling
Percent Employed
20
Year, Percent of males employed
Yield (%) on British
short term
government securities
in successive months
from about 1950 to
about 1971
Government Securities
240
Year, Yield per month (January-December)
27
Meyer, M. C. (2006). Wider shoes for
wider feet? Journal of Statistics
Education, 14(1). Data collected by the
author in a fourth grade classroom in
Ann Arbor, MI.
National Weather Service; Museum of
Science of Boston. Quoted in
Chatterjee, S., & Price, B. (1991).
Regression analysis by example (2nd
ed.). New York: John Wiley.
The Condition of Education (1991).
U.S. Department of Education. Quoted
in Chatterjee, S., Handcock, M. S., &
Simonoff, J. S. (1995). A casebook for
a first course in statistics and data
analysis. New York: John Wiley.
http://www.bath.ac.uk/~mascc/yield.TS
Title of Data Set
Yields from Vineyard
Harvest by Row
Number and Year of
Harvest, 1983-1991
Name on CD
Harvest Yield
n
468
Variables in Data Set
Harvest year, Row of vines, Yield of grapes
28
Source
Barnhill family archives, 1976-1991.
Quoted in Chatterjee, S., Handcock, M.
S., & Simonoff, J. S. (1995). A
casebook for a first course in statistics
and data analysis. New York: John
Wiley.
Sample Monte Carlo Simulation Program
data corr1(type= corr);
infile cards missover;
input _type_ $ _name_ $ v1-v6;
cards;
mean . 668.4 680.2 663.3 668.6 666.2 672.2
std . 39.1 48.8 39.1 37.9 37.6 48.1
n . 15000 15000 15000 15000 15000 15000
corr v1 1.00
corr v2 .91 1.00
corr v3 .96 .78 1.00
corr v4 .71 .65 .68 1.00
corr v5 .69 .64 .66 .95 1.00
corr v6 .64 .57 .62 .93 .77 1.00
;
run;
proc factor data=corr1 nfact=6 outstat=t1 noprint;
var v1-v6;
run;
title "Simulation Data for Classroom Models";
proc iml;
start sim1;
use work.t1;
read all var {v1 v2 v3 v4 v5 v6} into x12;
n=10000;
x11= {668.4 680.2 663.3 668.6 666.2 672.2};
xx12= {39.1 48.8 39.1 37.9 37.6 48.1};
g11= x12[13:18,]`;
a1= rannor(j(n, 6, 1));
a1_t= t(a1);
s_hat= g11*a1_t;
stand= t(s_hat);
m1= x11[1,1]; m2= x11[1,2]; m3= x11[1,3]; m4= x11[1,4]; m5= x11[1,5]; m6= x11[1,6];
s1= xx12[1,1]; s2= xx12[1,2]; s3= xx12[1,3]; s4= xx12[1,4]; s5= xx12[1,5]; s6= xx12[1,6];
col_g1= m1 + s1*stand[,1]; col_g2= m2 + s2*stand[,2]; col_g3= m3 + s3*stand[,3];
col_g4= m4 + s4*stand[,4]; col_g5= m5 + s5*stand[,5]; col_g6= m6 + s6*stand[,6];
n_data= col_g1||col_g2||col_g3||col_g4||col_g5||col_g6;
create sim1_data from n_data[colname= {x1 x2 x3 x4 x5 x6}];
append from n_data;
finish sim1;
run sim1;
data sample;
set sim1_data;
x1= round(x1, 1); x2= round(x2, 1); x3= round(x3, 1);
x4= round(x4, 1); x5= round(x5, 1); x6= round(x6, 1);
run;
proc corr data= sample;
run;
proc surveyselect data=sample sampsize= 200 out= temp1;
run;
30
Appendix A
Example Univariate Output for Arkansas Math.xls
The UNIVARIATE Procedure
s3MtScSc (Mathematics Scaled Score)
Variable:
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
216
223.013889
88.260106
-0.3278101
12417619
39.576058
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
216
48171
7789.84632
-0.1406821
1674816.96
6.00533957
Basic Statistical Measures
Location
Mean
Median
Mode
Variability
223.0139
226.0000
375.0000
Std Deviation
Variance
Range
Interquartile Range
88.26011
7790
375.00000
115.50000
Tests for Location: Mu0=0
Test
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
37.13593
107
11502.5
<.0001
<.0001
<.0001
Quantiles (Definition 5)
Quantile
Estimate
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min
375.0
375.0
375.0
341.0
286.0
226.0
170.5
115.0
51.0
9.0
0.0
Example Univariate Output for Arkansas Math.xls
Variable:
The UNIVARIATE Procedure
s3MtScSc (Mathematics Scaled Score)
Extreme Observations
----Lowest----
----Highest---
Value
Obs
Value
Obs
0
0
9
10
14
159
113
146
133
134
375
375
375
375
375
168
171
183
198
214
31
Stem
36
34
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
Leaf
47255555555555555
12671
033684
12477789267788999
0035772345899
134447778899334578
012345612223359
00225669999123445677889
011268012334577889
0001122557788901333489
01123678012457899
0245799004569
02702456
3577
10347
614
718
364
009045
----+----+----+----+--Multiply Stem.Leaf by 10**+1
#
17
5
6
17
13
18
15
23
18
22
17
13
8
4
5
3
3
3
6
Boxplot
|
|
|
|
+-----+
|
|
|
|
*--+--*
|
|
|
|
+-----+
|
|
|
|
|
|
|
|
Example Univariate Output for Arkansas Math.xls
Variable:
The UNIVARIATE Procedure
s3MtScSc (Mathematics Scaled Score)
Normal Probability Plot
370+
****** * *
|
**+
|
***
|
****
|
***
|
***
|
***
|
****
|
***
190+
****
|
***+
|
***
|
**+
|
*+
|
+***
|
++**
|
++ **
| +++ **
10+*** ***
+----+----+----+----+----+----+----+----+----+----+
-2
-1
0
+1
+2
32
33
Example Correlation from Literacy Test.xls
The CORR Procedure
3
Variables:
Scaled_Score
Total_Open_Ended
Total_Multiple_Choice
Simple Statistics
Variable
Scaled_Score
Total_Open_Ended
Total_Multiple_Choice
N
Mean
Std Dev
Sum
Minimum
Maximum
4999
4999
4999
212.60152
28.45009
30.25805
30.56291
13.00161
10.41199
1062795
142222
151260
14.00000
0
0
379.00000
48.00000
48.00000
Simple Statistics
Variable
Label
Scaled_Score
Total_Open_Ended
Total_Multiple_Choice
Scaled Score
Total_Open_Ended
Total_Multiple_Choice
Pearson Correlation Coefficients, N = 4999
Prob > |r| under H0: Rho=0
Scaled_
Score
Total_
Open_
Ended
Total_
Multiple_
Choice
Scaled_Score
Scaled Score
1.00000
0.83332
<.0001
0.80558
<.0001
Total_Open_Ended
Total_Open_Ended
0.83332
<.0001
1.00000
0.71297
<.0001
Total_Multiple_Choice
Total_Multiple_Choice
0.80558
<.0001
0.71297
<.0001
1.00000
34
Example ANOVA Output for Literacy Test.xls
The GLM Procedure
Class Level Information
Class
Levels
Race
Values
5
African-American Asian/Pacific Islander Hispanic Other White
Number of Observations Read
Number of Observations Used
4999
4999
Example ANOVA Output for Literacy Test.xls
The GLM Procedure
Dependent Variable: Scaled_Score
Scaled Score
Source
DF
Sum of
Squares
Mean Square
F Value
Pr > F
Model
4
314682.210
78670.553
90.24
<.0001
Error
4994
4353906.018
871.827
Corrected Total
4998
4668588.228
R-Square
Coeff Var
Root MSE
Scaled_Score Mean
0.067404
13.88829
29.52672
212.6015
Source
Race
Source
Race
DF
Type I SS
Mean Square
F Value
Pr > F
4
314682.2102
78670.5526
90.24
<.0001
DF
Type III SS
Mean Square
F Value
Pr > F
4
314682.2102
78670.5526
90.24
<.0001
Example ANOVA Output for Literacy Test.xls
The GLM Procedure
Tukey's Studentized Range (HSD) Test for Scaled_Score
NOTE: This test controls the Type I experimentwise error rate.
Alpha
0.05
Error Degrees of Freedom
4994
Error Mean Square
871.8274
Critical Value of Studentized Range 3.85915
Comparisons significant at the 0.05 level are indicated by ***.
Difference
Between
Means
Race
Comparison
Asian/Pacific
Asian/Pacific
Asian/Pacific
Asian/Pacific
White
White
White
White
Other
Other
Other
Other
Hispanic
Hispanic
Islander
Islander
Islander
Islander
-
White
Other
Hispanic
African-American
Asian/Pacific Islander
Other
Hispanic
African-American
Asian/Pacific Islander
White
Hispanic
African-American
Asian/Pacific Islander
White
35
3.7908
6.0878
18.1487
21.9304
-3.7908
2.2970
14.3579
18.1396
-6.0878
-2.2970
12.0609
15.8426
-18.1487
-14.3579
Simultaneous 95%
Confidence Limits
-7.8015
-8.1354
5.5481
10.1821
-15.3831
-6.1704
9.0501
15.4157
-20.3110
-10.7644
2.2583
7.1629
-30.7493
-19.6658
15.3831
20.3110
30.7493
33.6786
7.8015
10.7644
19.6658
20.8634
8.1354
6.1704
21.8636
24.5223
-5.5481
-9.0501
***
***
***
***
***
***
***
***
Hispanic
Hispanic
African-American
African-American
African-American
African-American
-
Other
African-American
Asian/Pacific Islander
White
Other
Hispanic
-12.0609
3.7817
-21.9304
-18.1396
-15.8426
-3.7817
-21.8636
-1.8587
-33.6786
-20.8634
-24.5223
-9.4220
-2.2583
9.4220
-10.1821
-15.4157
-7.1629
1.8587
Example ANOVA Output for Literacy Test.xls
The GLM Procedure
Level of
Race
African-American
Asian/Pacific Islander
Hispanic
Other
White
---------Scaled_Score-------Mean
Std Dev
N
1174
49
247
93
3436
199.436968
221.367347
203.218623
215.279570
217.576542
36
26.8714035
31.5401159
29.1798596
31.5365094
30.3219353
***
***
***
***
Download