Statistical Data Analysis

advertisement

Statistical Data Analysis

Chapter 9 - Montello and Sutton

An Introduction to Scientific

Research Methods in Geography

Overview

 Statistical data analysis

 Statistical description

 Statistical inference

 Geospatial Analysis

Data Analysis

 Set of display and mathematical techniques

 Logical and conceptual considerations

 Allows us to:

 Extract meaning from systematically collected measurements

 Communicate that meaning to others

Geographers and Data

 Geographers view data as statistical

(complex and imperfect) rather than deterministic

 Three reasons:

 Imperfect sample of larger population

 Measurement involves error

 Phenomena are expressions of complex sets of many interacting variables

Statistical Description

 Goal: summarize potentially important properties of our data using

 Parameters - summary indices to describe the population

 Properties:

 Central tendency

 Variability / dispersion

 Form / shape of distribution

 Relationships

Central Tendency

 Average or representative value

 Three most common:

 Mode - most frequent

 Median - middle value

 Mean (“average”)

Variability / Dispersion

 Tells how data points differ from the central tendency

 How representative the central tendency is

 Greater when variability is low

 Three common:

 Range - distance between high and low

 Variance - average of deviations from the mean

 Standard deviation - square root of the variance

Form / Distribution I

 Shape of entire data set

 Modality - number of local modes

 Skewness - distribution uneven

 Positive - mostly low and medium scores

 Negative - mostly medium and high scores

 Symmetry - mirror around central tendency

 Bimodal

 Unimodal normal or “bell-shaped” curve

Form / Distribution II

 Derived scores

 Describe the value of individual scores relative to the rest of the data set

 Three common:

 Rank - 1, 2, 3, etc.

 Percentile rank - percentage of the data that is less than the score in question

 z-score - standard deviation units above or below the mean of the data set

Relationships I

 Systematic (consistent) patterns of high or low values across pairs of variables

 Linear relationship - two variables form a straight line when graphed

 Positive (or direct) - high value A has high value B; low value A has low value B

 Negative (or indirect) - high value A has low value B; low value A has high value B

Relationships II

 Relationship strength - degree that patterns hold across all cases

 Correlation coefficient - square of correlation measure of relationship strength

 Regression analysis - expresses relationship as an equation that predicts the values of Y (criterion variable) as a function of X (predictor variable)

 Monotonic relationship - goes up or down; not necessarily in a straight line

Statistical Inference I

 Goal: Draw informed guesses about likely patterns in population, based on sample data evidence

 Assign probabilities to guesses

 Sampling distribution - distribution of a sample statistic based on all possible samples of a given size, from a given population

Statistical Inference II

 Assumptions:

 Distribution is normal and variances are equal

 Data values are independent

 Model specification (such as linearity, inclusive of relevant predictor constructs)

Statistical Inference III

 Two approaches:

 Estimation

 Point estimate - guess about specific parameter value

 Confidence interval - range of values distributed around the point estimate, expressed as probability

 Hypothesis Testing

 Null hypothesis ( H

0 parameter

) is about exact point of

 Alternative hypothesis ( H

A

) is that the exact point of the parameter is not the null

Statistical Inference IV

 Four possible outcomes, based on:

 Two possible truths ( H

0 is true, H

A

 Two possible decisions (reject H accept H

A

; reject both H

0 and H

A

0

) is false) and

 Two types of errors:

 Type I - reject H

0 when H

0

 Type II - fail to reject H

0 is true when H

0 is false

Geospatial Analysis

 Geography data are different :

 They are spatially distributed

 Have location, extent or size, shape, pattern, connectivity, etc.

 They represent natural and human earthsurface features and processes

 Spatiality is the focus or is central to the analysis

Spatiality

 Influences the accuracy of inferential statistical analyses of nonspatial variables

 Spatial autocorrelation exists when there are patterns of spatial dependence – places are “like” other places

 Distance decay – near things are “more like” each other than things further away

Areal Units

 Which areal units to use?

 Problems:

 Using data from continuous source, but treat with discrete spatial analysis techniques

 Politicization of unit determination (like gerrymandering )

 Modifiable Areal Unit Problem (MAUP) – effect that theoretically arbitrary areal geometries have on geographic analysis

Questions

 Why is data analysis in geography usually conceptualized in statistical (probabilistic) terms?

 What is meant by strength and form of statistical relationships?

 What is the purpose of statistical inference? Why are statistical inferences necessarily and ultimately uncertain?

 What are two types of correct decisions and two types of errors possible when hypothesis testing?

 What is spatial autocorrelation, what forms can it take, and why is it so important to geographic data analysis?

Download