Tmore information

advertisement
Ph.D. Dissertation Defense
Geographic Knowledge Discovery in Spatial
Interaction With Self-Organizing Maps
Jun Yan
Geography Department
SUNY at Buffalo
July 29, 2004
Dissertation Committee:
Dr. Jean-Claude Thill (Chair)
Dr. Ling Bian
Dr. David Mark
Outline

Background

Spatial Interaction Data

Methodology

Self-Organizing Maps

Visual Data Mining

Case studies

Conclusions and Future Research
Background

Data-rich vs computation-rich:

challenge?

opportunity !!!
Information
technologies
Two Legs!!!
More data
available
More tools
available
Background (Cont.)

Data Mining & Knowledge Discovery:
“useful information from large databases”




useful
novel
valid
Understandable
 Geographic data mining (GDM) and
geographic knowledge discovery
(GKD)?
Background (Cont.)

Mining techniques: statistics, pattern
recognition, machine learning, visualization,
high performance computing …

Knowledge discovery process
User
DBMS
Domain
Knowledge
Controller
DB
Interface
Target
Data
Selection
Data
Data
Mining
Mining
Knowledge Base
Knowledge discovery process
Evaluation
Discoveries
Background (Cont.)

Finding all the patterns autonomously
in a database?: unrealistic


because the patterns could be too many but
uninteresting
Data mining: an iterative, interactive, semiautomated process


people directs what to be mined
Visualization: Geovisualization (GVis)

visual data mining !!!
Visualization in KDD Process
Selecting Application
Domain
Selecting Target Data
Processing Data
Extracting
Information/Knowledge
Interpretation and
Evaluation
Understanding basic data
distribution, selecting
meaningful target
datasets
Locating missing
data, noise removing,
data smoothing
Parameters setting,
process tracking, process
steering
Interpretation, reporting,
comparison, validity
checking
Background (Cont.)

Machine learning & Neural Networks
Examples
Background knowledge
(sometimes)
Learning
Algorithm
Concept description
or
Other knowledge
Inputs
Outputs
Input layer
Output layer
Hidden layer
Background (Cont.)

Objectives:

Explore the effectiveness of neural
networks in GKD

Examine the roles of GVis in GKD
Spatial Interaction Data
 What is spatial interaction?

Pairs of places

Elemental: trips made by individuals

Aggregate: flows from origins to
destinations

Examples: migration, freight shipment,
movement of capital & information …
Spatial Interaction Data (Cont.)
Origin
Destination
Distance
Trip 1
Elemental level
Trip 2
Trip 3
Trip table
Region 1
Region 2
Region 3
Region1
Region 2
Region 3
Basic O-D matrix
Aggregate level
Type 1
Type 2
Region1>Region 1
Region1>Region 2
Region1>Region 3
Dyadic O-D matrix
Type 3
Spatial Interaction Data (Cont.)
 Exploring the Patterns of Interaction

Very necessary!!!

Existing Exploratory Data Analysis (EDA):
lack of interactivity

Challenges:

a large number of interactions

wide range of interaction magnitudes

multiple semantics
Spatial Interaction Data (Cont.)
 Multidimensionality!!!
Interaction semantics
Origin
O-D Matrices
Destination
Spatial Interaction Data (Cont.)
Electronic products
Vehicle and parts
Machinery
Photographic products
Methodology

Self-Organizing Maps (SOM)

Visual Data Mining (VDM):

SOM as core DM engine

Interactivity
Self-Organizing Maps

A crucial task of KDD: reduce data complexity
1)
Data Quantization: number of records, here
number of spatial interactions
2)
Data Projection: number of variables, here
number of interaction semantics

By reducing data complexity, identification of
meaningful geographic structures becomes
possible

Traditional multivariate statistical methods
share their limitations
Self-Organizing Maps (Cont.)
1.
A special type of
competitive neural
network;
2.
Based on some measure
of dissimilarity in the
attribute space;
3.
4.
Losing Node
Winning Node
Output
Capable of reducing
data complexity on two
dimensions
simultaneously
Actually an
unsupervised pattern
classifier.
Losing Node
Input Layer
Competitive Output layer
Self-Organizing Maps (Cont.)
1.
Best match unit (BMU)
changes its value to fit
with the input data;
2.
Its neighboring nodes
change their values to fit
with the input data as
well. Only the magnitude
decreases with distance;
3.
Like a flexible net;
4.
Similar data will locate
close to each other in the
mapping
mk (t  1)  mk (t )   (t )hck (t )( x  mk (t ))
Visual Data Mining

Framework
Dynamic linking
Assignment
Operation
Focusing
Brushing
Colormap
manipulation
Visualization Forms
Interaction Forms
Visualization Forms
Case Studies


Airline Origin and Destination Survey Market
Table (DB1Market): http://www.bts.org

10% of air flight itineraries

Geographic scale: airport level  280 metros in
Contiguous US

Temporal range: 1993 to 2002
Two case studies on DB1BMarket

Cross-sectional analysis

Temporal changes
Clustering Analysis
3
8
4
7
9-1
6
9-3
1
5
1.
2.
9-2
2
9-4
9-5
A cluster is an area of low
values (distance) surrounded
by areas of high values
(distance).
There are several clusters in the
feature map
8
4
3
7
9
6
1
5
2
Clustering Analysis (Cont.)
A cluster is a valley in a
3-D map
Cluster Analysis (Cont.)
Market
Share
Contribution
Cluster Analysis (Cont.)
AA
MQ
NW
XJ
QX
HP
QX
US
1
America West (HP)
2
US Air (US)
Continental (CO),
Continental Express (RU)
4
Northwest (NW), Mesaba
(XJ)
5
Horizon (QX)
6
United (UA)
7
Air Wisconsin (ZW)
8
American (AA), American
Eagle (MQ)
9-1
No dominant airlines
DL
9-2
Southwest (WN)
EV
9-3
Comair (OH)
9-4
Delta (DL)
9-5
Delta (DL), Atlantic
Southeast (EV)
Multiple
UA
Cluster Property (Airline)
3
CO
RU
WN
ZW
C#
DL
Cluster Analysis (Cont.)
Markets with US Airways
Market Share >= 50%
Markets Represented by
Cluster 2
Cluster 2
Cluster Analysis: Markets From Nashville
AA
CO
RU
WN
NW
DL
UA
US
EV
Cluster Analysis: Markets From Nashville (Cont.)
AA
CO
RU
WN
NW
DL
UA
US
EV
Association Analysis
Market
Share
Average
Airfare
Association Analysis (Cont.)
American
Delta
Association Analysis (Cont.)
Average Airfare, Delta (without
competition of Airtran)
Average Airfare, Delta (with
competition of Airtran)
Temporal Changes
Temporal Changes (Cont.)
TWA
2001
AA 1993
AA 2002
AA
2001
Temporal Changes (Cont.)
Continental share
Northwest share
Temporal Changes: Trajectory

Market from Buffalo to DC
01
00
93
98
96
US Airways share
01
00
93
98
96
Southwest share
01
00
93
98
96
US Airways fare
Conclusions

Data rich environment: large databases, and
high dimensionality

Data complexity reduction is crucial

Results suggest SOM:

summarize well the overall data distribution

capable of detecting clustered structures

can be used to analyze the properties of clustered
structures

can be used to study the associations among input
variables
Conclusions (Cont.)


Interactive visual data mining can:

examine subset data more closely

study relationships among interaction types

analyze how detected clusters are distributed in the
actual geographic space
Help us gain a better understanding of the
factors and spatial processes behind
Future Research


SOM/VDM analysis

DB1BMarket

Other types of spatial interaction data

Data at elemental level
Improved VDM environment

Human subject testing

Seemly-coupled
Thank You!
Questions? Comments?
Contact: junyan@buffalo.edu
Background (Cont.)

Geographic database fits the profile:

massive volume: GIS, GPS, Remote Sensing …

high dimensionality

Geographic data mining (GDM) and
geographic knowledge discovery (GKD)?

Current topic in GIS research
Background (Cont.)
Data
driven
Exploratory
analysis
Knowledge
construction
Analysis and
modeling
Evaluation of
results
Model
driven
Visual exploration
Time
& visual data
mining
Visual knowledge
construction &
refinement
Data presentation,
visualization of
uncertainty
Visual model
tracking,
model steering
Roles of Visualization
Visualization in KDD Process
Selecting Application
Domain
Selecting Target Data
Processing Data
Extracting
Information/Knowledge
Interpretation and
Evaluation
Understanding basic data
distribution, selecting
meaningful target
datasets
Locating missing
data, noise removing,
data smoothing
Parameters setting,
process tracking, process
steering
Interpretation, reporting,
comparison, validity
checking
Modeling Flows
 Modeling Flows

Spatial interaction models: “Gravity
Models”
 Other geographic factors:

Geographic relationships among origins?

Geographic relationships among
destinations?

Association among types of interaction?
Modeling Flows
 Modeling Flows

Spatial interaction models: “Gravity Models”

Push: origin

Pull: destination

Transportation cost: distance decay
Iij = k Pi Pj / dija
= k Pi Pj dij -a
Spatial Interaction Data (Cont.)
Spatial Interaction Data (Cont.)
Limitations of Traditional Multivariate Methods
 Data Projection



Factor analysis
Projection pursuit
Multi-dimensional scaling
 Data Quantization


Partitioning methods
Hierarchical methods
o
Linearity
o
Stationary
o
Normal distribution
o
Limited data amount
o
One dimension compression
o
Non-linear
o
Non-stationary
o
Distribution unknown
o
Sparse
o
Large data amount
o
Multi-dimensional
Visualization Forms
Interaction Forms
Interaction Forms
Data Distribution
1.
Similar data distributions
2.
But greatly reduced number of low values
3.
SOM prototype represents original data well
Cluster Analysis (Cont.)
Markets with Southwest
Market Share >= 50%
Markets with Southwest
Market Share >= 20%
Markets Represented by
Cluster 9-2
Cluster 9-2
Temporal Changes (Cont.)
US Airways share
American share
Temporal Changes (Cont.)
United share
Delta share
Temporal Changes (Cont.)
Temporal Trend: Trajectory (Cont.)

Market from Buffalo to NYC
01
96
01
00
93
US Airways share
96
01
00
93
JetBlue share
96
00
93
US Airways fare
Temporal Trend: Trajectory (Cont.)

Market from Buffalo to Atlanta
93
Delta share
98
98
98
93
93
Airtran Airways
share
Delta fare
Download