Powerpoint slides

advertisement

Getting Started with

Large Scale Datasets

Dr. Joni M. Lakin

Dr. Margaret Ross

Dr. Yi Han

Presentation Files Are Available:

http://www.auburn.edu/~jml0035/

(Under “Conference materials and resources” at the bottom of the page)

Opening questions

• How many of you primarily use SPSS for data analysis?

• How many are comfortable with using syntax (in SPSS or other programs)?

• How many already have plans to use a specific dataset?

• How many just curious about what’s available?

What Data is Available?

Dr. Yi Han

U.S. National Datasets

• NCES

U.S. National Datasets

• Restricted use licenses http://nces.ed.gov/nationsreportcard/researchcenter/license.aspx

International Datasets

International Datasets

PISA PIAAC

Accessing Data and Getting

Started

Dr. Margaret Ross

See PDFs

Key Issues in Working with

Large Datasets

Dr. Joni Lakin

Key issues

1.

2.

3.

Statistical weighting in SPSS

Practical significance and large samples

Matrix sampling

4.

Plausible values

SPSS skills that make working with large datasets easier:

5.

6.

Keeping and managing syntax

Merging datasets

7.

8.

Checking for duplicate cases

Missing data imputation

1. Statistical weighting in SPSS

• Weights allow us to better approximate the full population

• If African American students are 18% of population but 9% of my sample, I could weight each AA student 2.0 (so each observation is included twice in analyses) to get results that better reflect population-level effects.

• Types of weights

• Scale weights = multiplies observations to create a weighted sample of same size as population

• Proportional weights = may be below 1 to keep overall sample size the same as the sample

• Note

• When you’re reporting results, you can report weighted sample size, but you should also report unweighted sample sizes too

Using weights

These “weight” values are already in large datasets

ELS:2002 Race

UNWEIGHTED

Amer. Indian/Alaska Native

Asian, Hawaii/Pac. Islander

Black or African American

Hispanic, no race specified

Freq.

130

1460

%

.8

9.0

2020 12.5

996

1221

6.1

7.5

Hispanic, race specified

More than one race

White, non-Hispanic

Total

Amer.

Indian/Alaska

Native

1%

735 4.5

8682 53.6

16197 100.0

Asian,

Hawaii/Pac.

Islander

10%

Black or African

American

13%

White, non-

Hispanic

57%

Hispanic, no race specified

6%

Hispanic, race specified

8%

More than one race

5%

ELS:2002 Race

WEIGHTED

Amer. Indian/Alaska Native

Freq.

32781

%

1.0

Asian, Hawaii/Pac. Islander 142518

Black or African American 491321

4.2

14.4

Hispanic, no race specified 243607

Hispanic, race specified 298648

More than one race

White, non-Hispanic

147896

2054103

7.1

8.8

4.3

60.2

Total 3410873 100.0

Amer.

Indian/Alaska

Native

1%

White, non-

Hispanic

60%

Asian,

Hawaii/Pac.

Islander

4%

Black or

African

American

15%

Hispanic, no race specified

7%

Hispanic, race specified

9%

More than one race

4%

2. Practical significance and large datasets

• Because of large sample size, many negligible effects

(and ALL correlations) will be significant

• Must consider effect sizes and practical significance

ELS:2002 variables

Math test score

Reading test score

Mathematics self-efficacy

English self-efficacy scale

Independent Samples

Test

Wow!! All significant!!

t df Sig.

8.71

8593 <.001

-4.14

8593 <.001

14.65 8593 <.001

-2.19

8593 .029

Practical significance and large datasets

• Actually negligible differences for reading and small differences for math

ELS:2002 variables

Independent Samples

Test

Math test score

Reading test score

Mathematics self-efficacy

English self-efficacy scale t df Sig.

Cohen’s d

8.71

8593 <.001

0.19

-4.14

8593 <.001

14.65

8593 <.001

-0.09

0.32

-2.19

8593 .029

-0.05

3. Matrix sampling (be aware of…)

• Used in large-scale assessments when

Large domain being sampled (e.g., world history)

Need to cover many topics in limited time

Individual estimates of the constructs are less important than aggregate estimates (state level achievement)

• Usually requires IRT (item response theory) scoring methods to allow for comparable scores across examinees completing different items

Table from von Davier et al., http://www.ierinstitute.org/fileadmin/Documents/IERI_Monograph/IERI_Monograph_Volume_02_Chapter_01.pdf

4. Plausible values

• Can result from matrix sampling (with IRT models), bootstrapping, and missing data imputation

• In matrix sampling, individual estimates of skills are less reliable and plausible values better capture this error variance compared to single scores

• Results in multiple estimates of the student’s true score on the construct (will appear as multiple variables)

• Poor practice = averaging plausible values before analysis

• Produces biased estimates (von Davier et al., see notes)

• Better practice = using methods that analyze the different estimates together and produce standard error bars

• Refer to von Davier et al. link in notes

5. Keeping and managing syntax

• From any command window, can select “Paste”

• Makes sure analyses start with the same data selections:

Sample weights, split files, selecting relevant cases

• Good for keeping record of computed and recoded variables

6. Merging datasets

• Add cases = add more participants’ data

• Add variables = add variables for same participants from another dataset

Merging datasets--Adding variables

• Have to exclude duplicate variables from one dataset

• Check that values are really identical (if not, change variable name)

• Use Key Variables to match cases

7. Checking for duplicate cases

Duplicate cases output

• Will appear as a new variable “PrimaryLast”

• Will need to decide how to handle on case-by-case basis

• Merging datasets incorrectly can result in duplicates

• If variables are identical, delete one

• If variables are different, check that identification variables are correct

8. Missing data

• Methods that bias results:

• Mean substitution, listwise or pairwise deletion

• Methods that can provide less biased estimates

• Single imputation regression (better than above, but restricts variability)

Expectation-maximization (EM) —best of SPSS options, works well when data is missing at random

• Analyze  Missing Value Analysis

• Be sure to read up on “missing completely at random, missing at random”, and “missing not at random”

Other Resources

Dr. Lakin

AERA Research Grants and

Dissertation Grants

“The program seeks to stimulate research on U.S. education issues using data from the large-scale, national and international data sets supported by the National Center for Education Statistics (NCES), NSF, and other federal agencies, and to increase the number of education researchers using these data sets .”

Suggestions based on personal observations and the RFP:

• Must use a strong quasi-experimental design ( Schneider et al.,

Estimating Causal Effects: Using Experimental and Observational Designs )

• Regression discontinuity, propensity score matching, etc.

• Bringing in new quantitative approaches for other fields also very appealing (economics, epidemiology, etc.)

Check past grants to see which datasets are “neglected” (more recent datasets better)

• Prefer ideas that involve more successful multiple datasets in meaningful research are

• Analyses of recently international datasets have been more successful

Other opportunities

• IES Research Grants do fund secondary data analyses with

Exploration grant goals (any subject area) http://ies.ed.gov/funding/

• IES data training workshops http://ies.ed.gov/whatsnew/conferences/?cid=2

• AERA annual meeting usually has data training events:

• PDC02: Analyzing NAEP Assessment Data with Plausible Values…

• PDC13: Advanced Analysis using Adult International Large Scale

Assessment Databases

• PDC16: Using NAEP Data on the Web for Educational Policy Research

• Several on quantitative methods (including propensity scores)

• AERA Institute on Statistical Analysis for Education Policy

(summer)

• IES/NCES hosts STATS-DC conferences and summer institutes to train researchers in using specific datasets

Q&A

Presentation files are available from http://www.auburn.edu/~jml0035/

(Under “Conference materials and resources”)

Download