Ph.D. Dissertation Defense Geographic Knowledge Discovery in Spatial Interaction With Self-Organizing Maps Jun Yan Geography Department SUNY at Buffalo July 29, 2004 Dissertation Committee: Dr. Jean-Claude Thill (Chair) Dr. Ling Bian Dr. David Mark Outline Background Spatial Interaction Data Methodology Self-Organizing Maps Visual Data Mining Case studies Conclusions and Future Research Background Data-rich vs computation-rich: challenge? opportunity !!! Information technologies Two Legs!!! More data available More tools available Background (Cont.) Data Mining & Knowledge Discovery: “useful information from large databases” useful novel valid Understandable Geographic data mining (GDM) and geographic knowledge discovery (GKD)? Background (Cont.) Mining techniques: statistics, pattern recognition, machine learning, visualization, high performance computing … Knowledge discovery process User DBMS Domain Knowledge Controller DB Interface Target Data Selection Data Data Mining Mining Knowledge Base Knowledge discovery process Evaluation Discoveries Background (Cont.) Finding all the patterns autonomously in a database?: unrealistic because the patterns could be too many but uninteresting Data mining: an iterative, interactive, semiautomated process people directs what to be mined Visualization: Geovisualization (GVis) visual data mining !!! Visualization in KDD Process Selecting Application Domain Selecting Target Data Processing Data Extracting Information/Knowledge Interpretation and Evaluation Understanding basic data distribution, selecting meaningful target datasets Locating missing data, noise removing, data smoothing Parameters setting, process tracking, process steering Interpretation, reporting, comparison, validity checking Background (Cont.) Machine learning & Neural Networks Examples Background knowledge (sometimes) Learning Algorithm Concept description or Other knowledge Inputs Outputs Input layer Output layer Hidden layer Background (Cont.) Objectives: Explore the effectiveness of neural networks in GKD Examine the roles of GVis in GKD Spatial Interaction Data What is spatial interaction? Pairs of places Elemental: trips made by individuals Aggregate: flows from origins to destinations Examples: migration, freight shipment, movement of capital & information … Spatial Interaction Data (Cont.) Origin Destination Distance Trip 1 Elemental level Trip 2 Trip 3 Trip table Region 1 Region 2 Region 3 Region1 Region 2 Region 3 Basic O-D matrix Aggregate level Type 1 Type 2 Region1>Region 1 Region1>Region 2 Region1>Region 3 Dyadic O-D matrix Type 3 Spatial Interaction Data (Cont.) Exploring the Patterns of Interaction Very necessary!!! Existing Exploratory Data Analysis (EDA): lack of interactivity Challenges: a large number of interactions wide range of interaction magnitudes multiple semantics Spatial Interaction Data (Cont.) Multidimensionality!!! Interaction semantics Origin O-D Matrices Destination Spatial Interaction Data (Cont.) Electronic products Vehicle and parts Machinery Photographic products Methodology Self-Organizing Maps (SOM) Visual Data Mining (VDM): SOM as core DM engine Interactivity Self-Organizing Maps A crucial task of KDD: reduce data complexity 1) Data Quantization: number of records, here number of spatial interactions 2) Data Projection: number of variables, here number of interaction semantics By reducing data complexity, identification of meaningful geographic structures becomes possible Traditional multivariate statistical methods share their limitations Self-Organizing Maps (Cont.) 1. A special type of competitive neural network; 2. Based on some measure of dissimilarity in the attribute space; 3. 4. Losing Node Winning Node Output Capable of reducing data complexity on two dimensions simultaneously Actually an unsupervised pattern classifier. Losing Node Input Layer Competitive Output layer Self-Organizing Maps (Cont.) 1. Best match unit (BMU) changes its value to fit with the input data; 2. Its neighboring nodes change their values to fit with the input data as well. Only the magnitude decreases with distance; 3. Like a flexible net; 4. Similar data will locate close to each other in the mapping mk (t 1) mk (t ) (t )hck (t )( x mk (t )) Visual Data Mining Framework Dynamic linking Assignment Operation Focusing Brushing Colormap manipulation Visualization Forms Interaction Forms Visualization Forms Case Studies Airline Origin and Destination Survey Market Table (DB1Market): http://www.bts.org 10% of air flight itineraries Geographic scale: airport level 280 metros in Contiguous US Temporal range: 1993 to 2002 Two case studies on DB1BMarket Cross-sectional analysis Temporal changes Clustering Analysis 3 8 4 7 9-1 6 9-3 1 5 1. 2. 9-2 2 9-4 9-5 A cluster is an area of low values (distance) surrounded by areas of high values (distance). There are several clusters in the feature map 8 4 3 7 9 6 1 5 2 Clustering Analysis (Cont.) A cluster is a valley in a 3-D map Cluster Analysis (Cont.) Market Share Contribution Cluster Analysis (Cont.) AA MQ NW XJ QX HP QX US 1 America West (HP) 2 US Air (US) Continental (CO), Continental Express (RU) 4 Northwest (NW), Mesaba (XJ) 5 Horizon (QX) 6 United (UA) 7 Air Wisconsin (ZW) 8 American (AA), American Eagle (MQ) 9-1 No dominant airlines DL 9-2 Southwest (WN) EV 9-3 Comair (OH) 9-4 Delta (DL) 9-5 Delta (DL), Atlantic Southeast (EV) Multiple UA Cluster Property (Airline) 3 CO RU WN ZW C# DL Cluster Analysis (Cont.) Markets with US Airways Market Share >= 50% Markets Represented by Cluster 2 Cluster 2 Cluster Analysis: Markets From Nashville AA CO RU WN NW DL UA US EV Cluster Analysis: Markets From Nashville (Cont.) AA CO RU WN NW DL UA US EV Association Analysis Market Share Average Airfare Association Analysis (Cont.) American Delta Association Analysis (Cont.) Average Airfare, Delta (without competition of Airtran) Average Airfare, Delta (with competition of Airtran) Temporal Changes Temporal Changes (Cont.) TWA 2001 AA 1993 AA 2002 AA 2001 Temporal Changes (Cont.) Continental share Northwest share Temporal Changes: Trajectory Market from Buffalo to DC 01 00 93 98 96 US Airways share 01 00 93 98 96 Southwest share 01 00 93 98 96 US Airways fare Conclusions Data rich environment: large databases, and high dimensionality Data complexity reduction is crucial Results suggest SOM: summarize well the overall data distribution capable of detecting clustered structures can be used to analyze the properties of clustered structures can be used to study the associations among input variables Conclusions (Cont.) Interactive visual data mining can: examine subset data more closely study relationships among interaction types analyze how detected clusters are distributed in the actual geographic space Help us gain a better understanding of the factors and spatial processes behind Future Research SOM/VDM analysis DB1BMarket Other types of spatial interaction data Data at elemental level Improved VDM environment Human subject testing Seemly-coupled Thank You! Questions? Comments? Contact: junyan@buffalo.edu Background (Cont.) Geographic database fits the profile: massive volume: GIS, GPS, Remote Sensing … high dimensionality Geographic data mining (GDM) and geographic knowledge discovery (GKD)? Current topic in GIS research Background (Cont.) Data driven Exploratory analysis Knowledge construction Analysis and modeling Evaluation of results Model driven Visual exploration Time & visual data mining Visual knowledge construction & refinement Data presentation, visualization of uncertainty Visual model tracking, model steering Roles of Visualization Visualization in KDD Process Selecting Application Domain Selecting Target Data Processing Data Extracting Information/Knowledge Interpretation and Evaluation Understanding basic data distribution, selecting meaningful target datasets Locating missing data, noise removing, data smoothing Parameters setting, process tracking, process steering Interpretation, reporting, comparison, validity checking Modeling Flows Modeling Flows Spatial interaction models: “Gravity Models” Other geographic factors: Geographic relationships among origins? Geographic relationships among destinations? Association among types of interaction? Modeling Flows Modeling Flows Spatial interaction models: “Gravity Models” Push: origin Pull: destination Transportation cost: distance decay Iij = k Pi Pj / dija = k Pi Pj dij -a Spatial Interaction Data (Cont.) Spatial Interaction Data (Cont.) Limitations of Traditional Multivariate Methods Data Projection Factor analysis Projection pursuit Multi-dimensional scaling Data Quantization Partitioning methods Hierarchical methods o Linearity o Stationary o Normal distribution o Limited data amount o One dimension compression o Non-linear o Non-stationary o Distribution unknown o Sparse o Large data amount o Multi-dimensional Visualization Forms Interaction Forms Interaction Forms Data Distribution 1. Similar data distributions 2. But greatly reduced number of low values 3. SOM prototype represents original data well Cluster Analysis (Cont.) Markets with Southwest Market Share >= 50% Markets with Southwest Market Share >= 20% Markets Represented by Cluster 9-2 Cluster 9-2 Temporal Changes (Cont.) US Airways share American share Temporal Changes (Cont.) United share Delta share Temporal Changes (Cont.) Temporal Trend: Trajectory (Cont.) Market from Buffalo to NYC 01 96 01 00 93 US Airways share 96 01 00 93 JetBlue share 96 00 93 US Airways fare Temporal Trend: Trajectory (Cont.) Market from Buffalo to Atlanta 93 Delta share 98 98 98 93 93 Airtran Airways share Delta fare