1 - Spatial Database Group

advertisement
CS8715 FINAL EXAM STUDY NOTES
ANSWERS TO THE TEXTBOOK QUESTIONS
Chapter 1
1.1 Discuss the differences between spatial and nonspatial data.
Spatial data is data that has element related to space or location. Examples of spatial data include zip codes
and shapes (such as molecule shapes). Nonspatial data is any data without a spatial attribute. Nonspatial
data includes data such as names and most values (blood pressure, finish times in a marathon).
1.2 Geographic applications are a common source of spatial data. List at least four other important
sources of spatial data.
Satellite images
Medical images
Phone book (address)
Environmental agency
Transportation agency
Cell phone Company
Internet
1.3 What are the advantages of storing spatial data in a DBMS as opposed to a file system?
The focus of data storage via file systems is to minimize the computational time for an algorithm based on
the assumption that all necessary data are located within an infinite supply of main memory.
Unfortunately, this is not the case as most data cannot be stored completely in this manner; rather, the data
resides on a hard disk and retrieval of this information takes much more computational time when
compared to memory access. Thus, DBMS storage has focused on optimizing the I/O time. Storing spatial
data in a DBMS results in faster data retrieval processes because a DBMS utilizes an index for optimal data
queries. While indexing historically has resulted in a loss of spatial proximity, the development of spatial
DBMSs is helping to alleviate this problem. Extension of the classic B-tree data structure to the R-tree
structure has allowed for the handling of multidimensional extended objects, of which spatial objects are a
key component. Overall, while GIS developers are continually improving computational algorithms, the
greater increases in data access within the system lie in the computer’s ability to access data from the hard
disks and not the memory.
1.4 Database engines in general have underlying (a) data models and (b) query engines.
(a) The data model of the SDBs can be implemented as an extension to the relational data model via
the object relational extensions of the traditional relational databases. Object-relational databases
(ORDSs) offer two new features over the traditional relational databases (1) new estended types
and (2) new operations on these types. SDBs, at least in the OGIS model, can be viewes as
ORDBs with (1) six types (point, line polygon, multipoint, multilane and multipolygon) and (2) a
dozen new operations which are a subset of the operations defined by the 9-intersection model.
(b) The ORDB query engines offer hookups for new indices and query optimization hint. SDBs
implemented on OQDB type extensions can used these features for spatial indices and query
optimization.
1.5 List the differences and similarities between spatial, CAD, and image databases?
Similarities:
1. Use Object Relational database at the bottom layer.
2. Use spatial application as the top layer
3. User abstract data types
Differences:
1. reference frames
2. data model
3. domain specific rules for query optimization
Spatial, CAD and image databases all use Object Relational database at the bottom layer. They all use a
spatial application as the top layer which interacts with OR-DBMS through spatial database. They all use
abstract data types. They use different reference frames and data models. They use different domain
specific rules for query optimization
1.6 The interplay between the vector and raster data models has often been compared with the waveparticle duality in physics. Discuss.
The object model can be likened to particles in that in it, spatial entities are considered unique and separate
from each other. The vector structure makes regions appear discrete, just as particles can be seen as parts of
a whole. The field model is defined by functions that map spatial area to a given domain. The, the raster
structure provides a way to demonstrate continuous behavior, as is seen in waves.
1.7 Cognitive maps are described as “internal representations of the world and its spatial properties
stored in memory.” Do humans represent spatial objects as discrete entities in their minds? Is there a
built-in bias against raster and in favor of vector representation?
Humans generally think of spatial entities in a whole form. For example, when people think of a lake, they
think it is a whole water surface instead of multiple “water points” that form the lake. However, in some
special cases, people may think that a spatial entity consists of many points. For example, astronomers may
represent a galaxy as a polygon in the sky map. But people may think that the galaxy is composed of many
stars.
All in all, there is a built-in bias against raster representation and in favor of vector representation in human
mind. Actually either representation is just the model of the real world. That means either representation is
just the simplification of the real world.
Neither of them can express everything about the real world entity. Thus, in database applications and other
applications that are involved in the spatial entities, both of them are useful. It depends on the specific
application that which representation is used.
1.8. Why is it difficult to sort spatial data??
The essential problem is that spatial data has no natural sort order. As a consequence, many different sort
orders for the same data set will then be available, and agreement on which one to use will be difficult.
1.9. Predicting global climate change has been identified as a premier challenge for the scientific
community. How can SDBMS contribute in the endeavor?
Response:
Calibrating today’s complex climate change models requires an enormous amount of spatial and temporal
data. As a result, global climate modelers benefit from SDBMS in a variety of ways. First spatially
enabled databases provide a persistent and secure tool to input, store, and retrieve the spatial and temporal
data. Because of the large volume of data being collected and maintained, it is economically efficient to
have multiple clients (e.g., researchers at various institutions) access a small number of well maintained
data repositories. SDBMS are designed for this type of multi-client access. Further, the majority of
commercial and open source SDBMS use a standard query language (i.e., SQL), this allows clients to
access spatial database servers relatively independent of software or platform constraints. With the advent
of SQL3 (and other parallel query languages) clients can perform complex distance and topological queries
in addition to standard non-spatial queries. Lastly, through advances in spatial object indexing, these
queries can be executed with great efficiency.
In addition to the data management, access, and canned attribute and spatial query capabilities, SDBMS
provide researchers with a body of data that can be exploited thought data mining techniques to reveal
relationships between exogenous variables and climate change. Originating in the marketing and
management fields, data mining techniques offer enormous potential for modeling climate change and
other environmental process.
1.10. E-commerce retailers reach customers all over the world via the Internet from a single facility.
Some of them claim that geography and location are irrelevant in the Internet age. Do you agree?
Justify your answer.
We do not agree. There are several factors that are relevant to the geographical and location problems.
- Shipping
Suppose a Korean person wants to order an item from the E-commerce retailer whose warehouse is located
in U.S. If the retailer website does not support international shipping, both sides will have difficulty. Thus
the E-commerce retailer should consider its shipping boundary.
- Tax
If the retailer’s warehouse is located in Minnesota and a customer living in Minnesota wants to order an
item from the site, the item will be imposed of tax. However, no tax is imposed for cross-states transaction.
Thus the E-commerce retailer should consider tax problem according to customer’s location.
- Location Relevant Item
If the E-commerce retailer deals with real estate properties such as houses, the items in their website should
have location information. Travel package is another example to be considered with location information.
As you see in this example, item itself sometimes has a geographic property.
1.11. Define location-based and location-independent services, Provide examples.
What is a location-based service?
Location-based service is information service that provided by a device that knows where it is, and
modifies the information it provides accordingly. For example, any device that can measure location, or
any device with GPS, or any device whose location can be inferred, or any device that has knowledge of
location and can use it to select, transform, modify information, can provide location-based service.
Example of location-based services
For example, the proposed E-911 emergency system requires cell phone operators to provide the
government with the location of a 911 caller. In this case, the cell phone needs to be equipped with a device
(such as GPS) that can provide the caller’s location.
What is a location-independent service?
Location-independent service is information service that does not need to know, and does not need to
provide information about location.
Example of a location-independent service:
Certain website (server) provides music for users to download. Users can pay for the service using credit
cards.
Chapter 2
2.2. Weather forecasting is an example where the variables of interest – pressure, temperature, and
wind – are models as fields. But the public prefers to receive information in terms of discrete,
entities. For example, “The front will stall,” or “This high will weaken” Can you cite another
example where is field-object dichotomy is apparent?
Cite another example (weather forecasting was given) where this field-object dichotomy is apparent. Inside
a car, the engine and other components are monitored and if their temperature is out of a given range, for
example, the ‘Check Engine’ light inside the car lights up. The driver does not need to know at all times
the status of each component, just if the car is running properly.
2.3. A lake is sometimes modeled as an object. Can you give an example in which it might be useful to
model a lake as field?
If we want model the depth in different part of the lake, it would be more appropriate to model a lake as
field s, because there are no clear boundaries.
Are lake boundaries are well defined?
The boundaries of a lake are often not well defined. For instance, whether the extension in a river or a creek
should be considered lake, and during flood season, the lake disappears as the inflow/volume proportion
increases.
2.4: Match the columns:
Nominal

Ordinal

Interval

Ratio

Color spectrum
Social security #
Temp. in Celsius
Temp. in Kelvin
2.5 Design an ER diagram to represent the geographic and political features of the World. The
World consists of three entities: country, city, and river. On the basis of these three entities answer
the following questions:
We assumed that every country has exactly one business center. We further assume that every river is
owned by at least one country, but there may be countries that do not own any rivers. We also assumed that
the database contains cities that are not business capitals. Moreover, in out model, we allow countries not to
have diplomatic ties with any other country. Note, the thick lines on the diagram indicate full participation;
the thin lines indicate partial participation.
2.6 Consider the problem of representing the OGIS hierarchy in OO language such as Java. How
would you model inheritance? For example, MultiPoint inherits properties from both the Point and
GeometryCollection classes. How will you model associations and cardinality constrains? Where
will you us abstract classes?
For associations we will need to use abstract classes since they connect two classes together and they are
just interfaces. For the common functioning units we can use abstract classes. Inheritance will be modeled
by defining a class inside of another class or by defining it as a subclass to a previously defined class. As
far as cardinality, it can be resolved through multiple inheritances.
2.7 Which UML diagrams are relevant for data modeling? Do they offer any advantage over ER?
How will you represent the following concepts in UML: primary key, foreign key, entities,
relationships, cardinality constraints, participation constraints, and pictograms?
UML helps provide information relevant to both the field and object data models. For starters, the field
model is seen by the class diagrams and methods because specific objects are seen as belonging to a group
(with corresponding functionality). Underneath this, objects are assigned unique identification by the
system, but this is not shown in the UNL diagram. Instead, the object model is roughly represented by class
attributes, although there is no guarantee that the attributes uniquely identify objects (e.g. they may have
the same name).
While UML does not provide a way to quickly tell what makes an object unique (as the ER diagram does
by showing primary key attributes), this is unneeded since the objects that UML groups together have
already been identified, and the schema designer does not really need to know how. Then, UML provides a
distinct advantage over ER by showing how classes are related (e.g. through inheritance) and how a class is
defined, which together make it easy to transfer a schema to object definitions.
How to represent in UML?
Primary key – Is not done (unnecessary, considering all objects are uniquely identified by the same
attribute: oid).
Foreign key – Since there is no primary key modeled, there is nothing to reference in another class.
However, UML does model associations between entities.
Entities – These become classes, and additional information (such as methods) are included.
Relationships – Remain the same from the ER model except that M-M relationships do not necessarily
require a new class (unless the relation adds attributes).
Cardinality – Done the same way in UML.
Participation – This is done with aggregation (to demonstrate the part-of-a-whole nature of some classes
entities). The total participation needed in weak entities is not modeled.
Pictograms – These are added to class diagrams just as they were for
entities.
2.8 Model state-park example using UML
2.9. Classify the following into local, focal, zonal:
a. slope: focal
b. snow-covered park: local (Consider a point. Is is snow-covered?)
c. site selection: focal (Is the neighborhood around point X suitable?)
d. average population: zonal
e. total population: zonal
2.10. Many spatial data types in OGIS refer to a planar world. These include line string and
polygons. These may include large approximation errors while modeling the shape of Long River or
larger counties on the spherical surface of earth. Propose a few spherical spatial data types to better
approximate large objects on a sphere. Which spatial relationships (e.g. topological, metric) are
affected by planar approximations?
The metrics of large geographic features become distorted when projected onto a planer surface. In an
attempt to minimize this distortion, cartographers have developed a variety of projection systems (e.g.,
UTM, State Plane Meter, Alberts). However no projection scheme completely removes distortion of very
large spatial objections.
Rather we might consider expanding the set of basic shapes (and their collections) to include additional
higher dimension shapes. One such spatial data type might represent volume (and volume collections). A
spherical world is three dimensional, which has implications for non-topological spatial relationships (i.e.,
metric attributes and relationships are distorted on a sphere). Geographers and cartographers are very
familiar with the difficulties of projecting three-dimensional data into a two-dimensional model – changes
include direction, distance, and area. By modeling space in three dimensions, we can avoid the
compromises that projections demand – we can retain topology and also have correct direction, distance,
and area. Volumes would be comprised of one or more polygons.
While existing topological relationships would remain valid, new relationships (and query predicates)
would necessarily come into existence. These would describe the various relationships between the volume
(and volume collections) objects and the pre-existing simple objects.
2.11. Revisit the Java Program implementing the query about tourist-offices within ten miles of camp
ground “Maple.” The main()in class FacilityDemo will slow down linearly as the number of tourisoffices increases. Device a variation of plan-sweep algorithm from Chapter 1 to speed up the search
algorithm.
The fTable array is sorted on the X-coordinate of the locations of the facilities.
The method main (that finds the facilities within 10 miles of “Maples”) looks at the entries whose
Xcoordinates are less than or equal to (X-coordinate of “Maple”) + 10 checking each time, whether it lies
within 10 mile radius. This enables the method to stop scanning the array when the x-coordinate becomes
greater than (X-coordinate of Maple) + 10. Assumption : Coordinates of ‘Point’ and distance are expressed
in miles.
public class FacilityDemo {
public static void main(String[] args) {
Facility f = new Facility(“Maple”, “Campground”, Point(2.0,4.0));
Facility[] fTable = new FacilitySet(“facilityFile”);
String[] resultTable = new string[fTable.length];
Sort(fTable); //Sort fTable based on the X-coordinate of the
location;
int i = 0;
double xu = 2.0 + 10.0; // X-coordinate of “Maple” + 10
while (fTable[i].Point.x <= xu && i < fTable.length) {
f (f.withinDistance(fTable[i], 10.0)
&& fTable[i].type = “Tourist-Office”)
resultTable[j++] = fTable[i].name];
i++;
} //end while
}
We can cut down further on the scan if the facility “Maple” is in the array.
Since the array is sorted, we can do a binary search for the first entry with x-coordinate = 2.0. Find the
entry for “Maple” and scan in both directions till the X-coordinate of the facility is less than (X-coordinate
(Maple) – 10) OR greater than (X-coordinate (Maple) + 10)
This restricts the scan to the range
(X-coordinate (Maple) – 10) < x < (X-coordinate (Maple) + 10)
public class FacilityDemo {
public static void main(String[] args) {
Facility f = new Facility(“Maple”, “Campground”, Point(2.0,4.0));
Facility[] fTable = new FacilitySet(“facilityFile”);
String[] resultTable = new string[fTable.length];
int j = 0;
Sort(fTable); //Sort fTable based on the X-coordinate of the
location;
x = BinarySearch(fTable, 2.0); //Uses binary search to find the
first entry with x
// X coordinate= 2.0; returns the array index
if (x == -1) return; // Assuming that search will return -1 on
failure
// Method will work only if “Maples” is in the array
while (fTable[i].Point.x = 2.0
&& fTable[i].Name <> “Maple” && x < fTable.length)
x++;
double xu = 2.0 + 10.0; // X-coordinate of “Maple” + 10
double xl = 2.0 - 10.0; // X-coordinate of “Maple” - 10
int i = x;
while (fTable[i].Point.x <= xu && i < fTable.length) {
if (f.withinDistance(fTable[i], 10.0)
&& fTable[i].type = “Tourist-Office”)
resultTable[j++] = fTable[i].name];
i++;
} //end while
int i = x;
while (fTable[i].Point.x >= xl && i >= 0) {
if (f.withinDistance(fTable[i], 10.0)
&& fTable[i].type = “Tourist-Office”)
resultTable[j++] = fTable[i].name];
i--;
} //end while
} // end main
} //end FacilityDemo
2.12 Develop transaction rules to convert pictograms in a conceptual data model (e.g., ERD) to OGIS
spatial data types embedded in a logical data model (e.g., Java, SQL) or physical data model
Translation Rules to convent pictograms in a conceptual data model to OGIS spatial data types in a logical data
model:
1) Syntax directed translation:
Non-terminating elements need to be translated to terminating elements.
For example:
<pictogram> <shape> | <any possible shape *> | <user-defined shape ! >
( | demotes OR)
<shape> <basic shape> | <multi-shape> | <derived-shape> | <alternates-shape>
<basic shape> point | line | polygon
<multi-shape> <basic shape> <Cardinality>
<derived-shape> <basic shape>
<alternate-shape> <basic shape> <derived-shape> | <basic shape> <basic shape>
Cardinality quantifies the number of objects, so it indicates the presence or absenceof the objects. So the
cardinality will result in integrity constraints.
For example:
<Cardinality> 0, 1 | 0, n | 1 | n | 1, n
( 0, 1 ) or ( 0, n ) denotes we have 0 or many objects, so NULL is allowed.
( 1 ), or ( n ), or (1, n ) denotes we have 1 or many objects, so it is NOT NULL
2) Translating entity pictograms to data types
Entity pictograms inserted inside the entity boxes will be translated into appropriate data types. There data types
are Point, Line, and Polygon. The translator using the syntax-directed translation will do this translation
automatically.
For example:
<pictogram data type> PointType | LineType | PolygonType
<Raster Partition> <Raster> | <Thiessen> | <TIN>
3) Translating relationship pictograms
Relationships among geographic entities are actually conditions on the objects’ positions and are called spatial
relationships. Relationship pictograms will be translated into spatial integrity constraints of the database.
For example:
<relationship> part_of (partition) | part_of(highrachical)
4) Translation of entities and relationships to tables
(1) Map each entity onto a separate relation. The attributes of the entity are mapped onto the attributes of the
relation. The key of the entity is the primary key of the relation.
(2) For relationship whose cardinality is 1:1, the key attribute of any one of the entities is placed as a foreign key
in the other relation.
(3) If the cardinality of the relationship is M:1, then place the primary key of 1-side relation as a foreign key in
the relation of M-side.
(4) If the cardinality of the relationship is M:N, then each M:N relationship is mapped onto a new relation. The
name of the relation is the name of the relationship, and the primary key of the relationship consists of the pair of
primary keys of the participating entities. If the relationship has any attributes, then they become attributes of the
new relation.
(5) For a multi-valued attribute, a new relation is created which has two columns: one corresponding to the multivalued attribute and the other to the key of the entity that owns the multi-valued attribute. Together the multivalued attribute and the key of the entity constitute the primary key of the new relation.
2.15
OGIS data type
Pictogram
Point
Line string
Polygon
Multipoint
n
Multiline string
n
Multipolygon
n
2.17
Countries on a map of the world are represented as surface. They are modeled by polygons. Rivers are
represented by curve. They are modeled by line strings. Lakes are represented by surface. They are
modeled by polygons. Highways are represented by curve. They are modeled by multiline string. Cities
are represented by point.
As the scale of the map goes larger, the spatial data types change from one-dimensional objects to twodimensional objects. That is to say, it changes from point to string, from string to polygon. In order to
represent scale dependence in pictograms, one could mark the map scale on each entity in the pictogram.
2.18
1. A single manager can only manage a single forest.
2. A manager must manage all the forest-stands in a forest (given that no manager can comanage a forest).
3. Many (or one) fire-stations can monitor a single forest (better safe than sorry).
4. The forest can have many facilities within its borders (“belonging”), and every facility
can only be associated with one forest.
// NOTE: For 5 and 6, no specific spatial relationship can be defined, but by
knowing the object types involved, we can determine likely relationships.
5. River is connected to Forest by Road, so the spatial relationship is line→line→polygon
We know that a river crosses a road and the road crosses the forest, but it may not be
true that a river crosses a forest. If the two have any relationship at all, it might be
“crosses” or “borders”.
6. The relationships are of the form polygon→polygon. In this case, we also know that
many forest stands are within the forest. Then, the relationships are “contains” or
“overlaps” (for when the boundary of a forest cuts through a forest_stand object).
2.20 Study the relational schema in Figure 2.5 and identify any missing tables for M:N relationships.
Also identify any missing foreign keys for 1:1 and 1:N relationships in Figure 2.4. Should the spatial
tables in Figure 2.6 include additional tables for collections of pints, line-strings or polygons?
M:N relationships in Figure 2.4:
Relationship
Included in Figure 2.4?
Facility – supplies_water_to – River
Yes [Supplies_Water_To]
River – crosses – Road
Not
Road – accesses – Forest
Yes [Road-Access-Forest]
The ‘River – crosses – Road’ relationship can be computed from the geometries, hence there is no need to
explicitly include a table designated for this relationship.
1:1 relationships in Figure 2.4:
Relationship [A – relationship – B]
‘A’ as foreign key in ‘B’
‘B’ as foreign key in ‘A’
Manager – manages – Forest
Manager has ForName
Forest has no manager
Facility – belongs_to – Forest
Facility has ForestName
Forest has no facility nm
Strictly speaking, in a 1:1 relationship between entities A and B, it is enough to have one of the entities
included into the table of the other i.e. it is enough to include the forest name in the Manager table.
However, in the above example, if the manager of a forest is frequently queried, then instead of joining the
Forest and Manager tables for every query, it may be advantageous to record the manager into the Forest
table. This brings in redundancy and consequently issues of potential delete and update inconsistencies, but
such a decision can increase the performance. [Physical Schema Tuning]
1:M relationships in Figure 2.4:
Relationship [A – relationship – B]
‘A’ as foreign key in ‘B’
‘B’ as foreign key in ‘A’
Facility – within – Forest
Facility has ForestName2
Not possible.
ForestStand – part_of – Forest
Forest-name in Forest-Stand
Not possible.
FireStation – monitors – Forest
ForName in Fire-Station
Not possible.
Since the relationships in question are 1:M, it is not possible to apply the physical schema design ideas
described above. If we tried to insert all facilities that a certain forest contains into the Forest table, we
would even violate the first normal form.
On the other hand, if we stored the set of facilities as a multi-point, that is as a single object, then we could
possibly create a Facility attribute in the Forest table to avoid repetitive computation of the join. In this
case, in effect, we replace a 1:M relationship with a 1:1 relationship by representing a set of points as a
single object (multi-point).
The multi-object types are not necessary, but they can be beneficial as the above examples shows.
2.21
1. Alternate shape can be used to model roads because it can be linestring or polygon
2. Alternate shape can be used to model roads because it can be collection of points or collection of
polygons
3. One entity RoadNetwork can be added. It has type of collection of linestring and collection of polygons
and connected to entity Roads be relationship Partof
4. Manager entity can be modeled as point shape
2.22
I looked at the TIGER files, and tried to gleen as much information as possible from the documentation and
the data dictionary that describes all of the record types that are included in TIGER/Line files. It appears to
me that the schema would resemble a star schema like that of data warehouses. It seems to be made up of
“county” data, which encapsulates addresses, zip codes, geographical features, landmarks, etc. So the
diagram would look something like this:
With the county fact table in the middle, and all of the other supporting tables linked into that one central
data repository.
2.23
Answer1: The way that the ER diagram is presented in Figure 2.4 and 2.7 shows that several fire stations
can exist, but each only monitors one forest (many-one relationship). To allow a fire station instance to
monitor more than one forest the relationship would need to be changed form a many-one to a many-many.
In the relational schema, the forest-name key in the relational schema would be drop and replaced with an
attribute that notes the forest name that a given station might service. That is, allow a single station to
respond to multiple forests.
However, it might be preferable to drop the forest name attribute all together and add a attribute that
specifies the maximum distance within which a fire station will respond to a fire. This approach is
preferable because it explicitly considers the spatial relationship between the station’s location and the
location of the fire. Once a fire is sited a simple nearest neighbor query could identify the stations that
should respond.
Answer2:
Forest
Fig 2.4
M
N
monitorss
Fire-Station
Fig 2.5
We need to create a relationship relation ‘Monitors’ between Fire-Station and Forest relations. Then,
remove foreign key ForName from Fire-Station.
Monitors
FStaName
(varchar)
Fire-Station
ForName
(varchar)
Name
(varchar)
Fig 2.7
Forest
M
N
monitorss
Fire-Station
2.24 Left and right sides of an object are inverted in mirror images, however top and bottom are not.
Explain using absolute and relative diagrams.
This question deals with the directional relationship between an object, a viewer, and a common coordinate
system. The viewer sees the left/right switch when looking directly at the object versus looking at the
object through the mirror. This switch is seen because the object’s (relative) left and right actually do
switch when the object is rotated into the mirror. The top and bottom of the object stays constant because
they are common to both the viewer and the object. In some ways the object represents a second viewer
looking back at the observer (both maintain their own relative left and right and share a common up and
down).
Chapter 3
3.1 Express the following queries in relational algebra.
a)
Find all countries whose GDP is greater than $500 billion but less than $1
trillion.
R   Name ( GDP500 and GDP1000 (Country))
b)
List the life expectancy in countries that have rivers originating in them.
1. R   Life Exp (Country)
2. S   Origin (River )
3.
c)
RS
Find all the cities that are either in South America or whose population is
less than two million.
R   Name ( Cont  SAM (Country))
2. S   Name ( Pop 2 (City))
1.
3.
d)
RS
List all the cities which are not in South America.
R   Name (City)
2. S   Name ( Cont  SAM (Country))
3. R  S
1.
3.2
(a)
SELECT
C.Name
FROM
Country C
WHERE C.GDP > 500.0 AND
C.GDP < 1000.0
(b)
SELECT
C.Name, C.Life-Exp
FROM
Country C, River R
WHERE C.Name = R.Origin
(c)
SELECT
Ci.Name
FROM
City Ci, Country Co
WHERE Ci.Pop < 2.0 OR
(Co.Continent = ‘SAM’ AND
Co.Name = Ci.Country)
(d)
SELECT
Ci.Name
FROM
City Ci, Country Co
WHERE Co.Continent <> ‘SAM’ AND
Co.Name = Ci.Country
3.3 Express in SQL the queries listed below
A. Count the number of countries whose population is less than 100 million.
SELECT COUNT(Co.Pop)
FROM Country Co
WHERE Pop < 100;
B. Find the country in North America with the smallest GDP (use nested query)
SELECT Co.Name
FROM Country Co
WHERE GDP <= ALL
(SELECT Cou.GDP
FROM Country cou
WHERE Cou.Cont=NAM);
C. List all countries that are in North America or whose capital cities have a population of less than 5
million.
SELECT DISTINCT C.Name
FROM Country C LEFT JOIN City CI ON (CI.Country = C.Name)
WHERE C.Cont = 'NAM'
OR (CI.Capital = 'Y' AND CI.Pop < 5);
D. Find the country with the second highest GDP.
SELECT Name
FROM Country
WHERE GDP >= ALL
(
SELECT GDP
FROM Country
WHERE GDP < (SELECT MAX(GDP)
FROM Country)
)
AND GDP < (SELECT MAX(GDP)
FROM Country);
3. 4 (pp.76) The Reclassify is an aggregate function that combines spatial geometries on the basis of
nonspatial attributes. It creates new objects from the existing ones, generally by removing the
internal boundaries of the adjacent polygons whose chosen attributes is same. Can we express the
Reclassify operation using OGIS operations and SQL92 with spatial datatypes? Explain.
We can express the Reclassify operation using OGIS union and touch operations. We can use the touch
operation to test if the polygons are adjacent, and use the union operations to remove the internal
boundaries, based on test of the nonspatial attributes. For example, if we want to reclassify the map of
countries on the basis of the majority religion practiced in the countries, the touch operation can test if the
two countries are adjacent, if true, we can using SQL92 to test if the two countries have the same majority
religion, if true, then the boundaries between the neighboring countries are removed through union
operation.
3.5: Discuss the geometry data model of Figure 2.2. Given that on a “world” scale, cities are
represented as point datatypes, what data type should be used to represent the countries of the
world. Note: Singapore, the Vatican, and Monaco are countries. What are the implementation
implications for the spatial functions recommended by the OGIS standard.
The OGIS geometry model for representing spatial entities in spatial databases provides a general
Geometry shape defined on a particular spatial reference system. This general Geometry shape breaks
down into four basic shapes – zero-dimensional points, one-dimensional curves, two-dimensional surfaces,
and geometry collection. The last shape is essentially a collection of the other three basic shapes that
provides closure. Based on Figure 2.2, we see that points make up line strings (which approximate curves),
and
line
strings
make
up
surfaces.
On a world scale, points will represent cities and surfaces will represent countries,
no matter how small the countries are. In a spatial database, we should store countries, even small ones like
the Vatican City, San Marino, and Monaco, as surfaces so that we can nest cities (e.g., Monte Carlo in
Monaco) within the surfaces. This nesting allows us to execute point-in-polygon queries on the database.
Also, we should store all countries as the same data type – storing some countries as points and some as
polygons makes it difficult to perform spatial analysis. Maintaining entities, like countries, in the same data
type should ease the implementation headaches of the OGIS spatial functions.
Now, if we want to make a world map of countries from this spatial database, the small size of some
countries will make it difficult to render them on a regular sheet of paper. Thus, it may make sense to
convert small countries to points for the map. Ultimately, within the spatial database, we should maintain
countries as surfaces for analytical and topological processing. Moving from the spatial database to a map,
however, may require transformations among data types.
3.6 [Egenhofer, 1994] proposes a list of requirements for extending SQL for spatial applications. The
requirements are shown below [Refer to table on page 77]. Which of these recommendations have
been accepted in the OGIS SQL standard? Discuss possible reasons for postponing the others.
Spatial Abstract Data Type: implemented.
Graphical Presentation: The results returned by the SQL engine are tables not graphical images. Consider
a query like ‘ Summarize the distances of the capital cities for the Equator’. The SQL is the interface to the
DB, not the interface to the user.
Result Combination: The queries are closed, Hence their results can be combines. Implemented.
Context: SQL is a mathematically sound language; it can not just haphazardly include extra information.
Content examination: SQL is not a map drawing tool. Graphical user interfaces, map drawing extensions,
querying by a mouse are to be implemented by middle-ware.
Selection by pointing: The above applies.
Legend: SQL’s responsibility is to retrieve data from the database in a tabular form. Putting legends onto
the data is not the SQL engine’s responsibility; but that of some visualization engine’s.
Label; You can explicitly store label in a database if you want to. SQL should not return these labels unless
asked for , for reasons explained under ‘Context’
Selection of map scale: Mapping is not the SQL engine’s responsibility/
Area of interest: You can restrict your area of interest by using the WHERE clause.
3.7 The OGIS standard includes a set of topological spatial predicates. How should the standard be
extended to include directional predicates such as East, North, North-East, and so forth. Note that
the directional predicates may be fuzzy: “Where does North-East end and East begin?
By applying distance operator to a directional neighborhood graph. In this case predicates East, North,
North-East etc would determine the distance between two directional relations with the use of shortest path.
Distance would be determining where a direction starts and where it ends.
3.8 (p. 76 – P. 78)
Assumptions:
For this assignment, T means that the relation will hold for all spatial types (e.g. In the case of TOUCH, we
have T on Bound(A) ∩ Bound (B). The intersection of these two is a point or line, but that is not described
here. Instead, T means that the boundaries of two points, two lines, or two polygons must intersect, not
HOW they intersect).
Secondly, * in the description denotes a value we don’t care about (that may or may not intersection)
Logically, in some intersections, we can intuit that the objects MUST or MUST NOT intersect, and thus,
we do not need to set these “extraneous” intersections as T or F. For example, in TOUCH, we have True on
Bnd(A) ∩ Bnd (B), Int(A) ∩ Ext(B), and Ext(A) ∩ Ext(B), and we have False on Int(A) ∩ Int(B). These
rules imply that
Int(A) ∩ Bnd(B) will be False, and hence, we mark this as * since we did not need to know this fact in
order to model the relation.
The matrices for the general cases are:
a.) Touches
F * T
* T *
T * *
Crosses T * T
* * *
* * *
b.) This matrix is the overlap operation where two lines have the same direction, but different lengths (e.g.
River overlaps a tour boat route)
c.)
Disjoint F F *
Contains T * *
F F *
* * *
Inside
T * F
* * *
F F *
Equal
* * F
* * *
T * F
* T *
F * T
Meet
Covered By
F * T
* T *
T * *
Covers T * T
T * F
* T *
F * *
Overlaps
* T *
F * *
T * T
* T *
T * T
3.9 Express the following queries an SQL, using the OGIS extended datatype and functions.
(a) List all cities in the City table which are within five thousand miles of Washington, D.C.
select C1.Name, C1.Country,C1.Pop
from City C1,City C2
where C2.Name=”Washington DC” and DISTANCE(C2.Shape,C1.Shape)<500 miles
(b) What is the length of Rio Panamas in Argentina and Brazil?
select R.Name,sum(length(Intersection(R.Shape,C.Shape)))
from River R,Country C
where Cross(R.Shape,C.Shape)=1 and (C.Name=”Argentina” or C.Name=”Brazil”)
(c) Do Argentina and Brazil share a border?
select Touch(Co.1Shape,Co2.Shape)
from Country Co1,Country Co2
where Co1.Name=”Argentina” and Co2.Name=”Brazil”
(d) List the countries that lie completely south of equator.
select Co1.Name
from Country Co1
where 0 < select (max (Co2.Shape.LineaRing.LineString.y1)
from Country Co2))
3.10 (P. 78)
a) Names of all rivers that cross Itasca State Forest -SELECT river.name FROM river, forest
WHERE
Cross(river.geometry, forest.geometry)
b) Names of all tar roads that intersect Francis Forest -SELECT road.name FROM road, forest
WHERE
road.type = "tar"
AND Intersect(road.geometry, forest.geometry)
c) Names of all roads with stretches within the floodplain of the river
Montana -SELECT road.name FROM road, river
WHERE
Intersect(road.geometry, river.flood-plain)
d) IDs of land parcels within 2 miles of Red River or 5 miles of Big Tree S.P. -SELECT land-parcel.id FROM land-parcel, river, forest
WHERE
(Intersect(land-parcel.geometry, Buffer(river.geometry, 2))
AND river.name = "Red")
OR (Intersect(land-parcel.geometry, Buffer(forest.geometry, 5))
AND forest.name = "Big Tree")
e) Names of rivers that define part of the boundary of a county -- not
possible given current schema. If a county table were present:
SELECT UNIQUE river.name FROM river, county
WHERE
Touch(river.geometry, county.geometry)
County table could be built as a view from land-parcel table, if Union() took many arguments, but the
table on page 66 defines it as taking only two.
3.11 Study the compiler tools such as YACC (Yet Another Compiler Compiler). Develop a syntax
scheme to generate SQL3 data definition statement from the Entity Relationship Diagram annotated
with pictograms.
The Entity Relationship Diagram (ERD) with pictograms provides a simple and clear way to model spatial
and non-spatial data relationships. Pictograms can convey the spatial data types, the scale, and implicit
relationships of spatial entities. These implicit relationships are topological, directional, and metric
relationships.
Pictograms are graphical representations of data type objects, typically these are abstract data types. The
objects can be basic shapes, multi-shapes, derived shapes, alternative shapes, and potentially a user defined
object. All objects are described by their unique graphic symbol. There is a set of rules that allow
pictograms to convey information about multishapes, derived shapes, and alternative shapes (i.e., shape
representation and behavior changes with scale).
In a ERD that uses pictograms, it is not necessary to explicitly depict the relationships between spatial
objects. Rather these relationships are object specific and therefore implicit.
The SQL3 allows the user to define abstract data types much like classes are defined in many object
oriented languages (e.g., c++, python, or java). Specifically, there is a data type name (i.e., class name)
followed by member data types, then member functions.
These member functions are used to reveal, alter, or perform operations on, or with, other objects. SQL3
provides provisions for a form of inheritance. In this way a basic shape, like a point, can be used to
construct more complex shapes such as lines or polygons.
An ERD with embedded pictograms, and SQL3, provide complementary tools to conceptualize
objects/object relationship and manipulate these data types in a spatial database. The challenge is to
develop a set of rules (i.e., grammar) to adequately transfer the pictogram representation of objects into
valid SQL3 object declarations.
Yet Another Compiler Compiler (YACC) is a low level programming language that allows users to make a
language (i.e., set of rules) to structure program input. In addition to defining the grammar elements of their
language, users can also specific functions that are invoked when certain rules are satisfied (e.g., a function
is called when a token word or character is recognized in the program input parameters).
A set of rules that can be interpreted by a program such as YACC can bridge the gap between the graphical
representation of objects in the pictogram and the SQL3 syntax.
For instance a set of rules could be written that recognizes the pictogram point specifications:
input parameters: <pictogram><shape><basic shape><point><cardinality argument>;
The set of user specified grammar rules and associated data members and member functions (e.g.,
distance()), could be used to translate the input parameters into SQL3 syntax. Example output in SQL3
form:
CREATE TYPE Point (x NUMBER, y NUMBER,
FUNCTION distance (: u Point,u Point) RETURNS NUMBER);
In this manner, YACC type rules could be defined for all OGR data types and user defined data types.
3.12 How would one model the following spatial relationship using 9-intersection model or OGIS
topological operations?
(a) A river (LineString)originates in a country (Polygon)
SELECT R.Name
FROM River R, Country C
WHERE CONTAINS(C.Shape, R.Origin) = 1;
(b) A country is completely surrounded by another country
SELECT C1.Name
FROM Country C1, Country C2
WHERE WITHIN(C1.Shape, C2.Shape) = 1;
(c) A river falls into another river
SELECT R1.Name, R1.Name
FROM River R1, River R2
WHERE INTERSECT(R1.Shape, R2.Shape) = 1;
(d) Forest stands partition a forest
SELECT F.Name
FROM Forest F, Forest_Stand FS
WHERE CROSS(FS.Shape, F.Shape) = 1;
3.13. Review the example RA queries provided for state park database in Appendix. Write SQL
expressions for each RA query.
Example Query: Find the names of the StatePark which contains the Lake with Lid number 100
CREATE VIEW Lake100 AS
SELECT Pl.Sid
FROM ParkLake Pl
WHERE Pl.Lid = ‘100’
SELECT Sp.Sname
FROM Lake100, StatePark Sp
WHERE Pl.Sid = Sp.Sid
1) Query: Find the names of the StateParks with Lakes where the MainCatch is Trout.
CREATE VIEW TroutLake AS
SELECT La.Lid
FROM Lake La
WHERE Main-Catch = ‘Trout’
SELECT Sp.Sname
FROM TroutLake, ParkLake Pl, StatePark Sp
WHERE Pl.Lid = La.Lid AND
Sp.Sid = Pl.Sid
2) Query: Find the Main-Catch of the lakes that are in Itasca State Park
CREATE VIEW ItascaParkId AS
SELECT Sp.Sid
FROM StapePark
WHERE Sname = ‘Itasca’
SELECT La.Main-Catch
FROM ItascaParkId, ParkLake Pl, Lake La
WHERE Sp.Sid = Pl.Sid AND
La.Lid=Pl.Lid
3.18 (p.79) Revisit relational schema for state park example in Section 2.2.3. Outline SQL DML
statements to create relevant tables using OGIS spatial data type.
CREATE
TABLE Forest-Stand (
Stand-id Integer,
Species varchar(30)
Forest-name varchar(30)
Shape
Polygon
PRIMARY KEY(Stand-id));
CREATE
TABLE River (
Name
varchar(30)
Length Real
Shape
Line
PRIMARY KEY(Name));
CREATE
TABLE Road (
Name
varchar(30)
NumofLanes
Integer,
Shape
Line
PRIMARY KEY(Name));
CREATE
TABLE
Facility (
Name
varchar(30),
Forest-name
varchar(30)
Forest-name-2 varchar(30)
Shape
Point
PRIMARY KEY(Name));
CREATE
TABLE Forest (
Name
varchar(30)
Shape
Polygon
PRIMARY KEY(Name));
CREATE
TABLE Fire-Station (
Name
varchar(30)
Forest-name varchar(30)
Shape
Point
PRIMARY KEY(Name));
CREATE
TABLE Supplies_Water_To (
FacName
varchar(30),
RivName
varchar(30)
Volume Real
PRIMARY KEY(FacName, RivName));
CREATE
TABLE Road-Access-Forest (
RoadName
varchar(30),
ForName
varchar(30)
PRIMARY KEY(RoadName, ForName));
CREATE
TABLE Manager(
Name
varchar(30),
Age
Integer
Gender varchar(30)
ForName
varchar(30)
PRIMARY KEY(Name, Age, Gender));
3.19. Consider shape-based queries, for example, list countries shaped like ladies boot or list squarish
census blocks. Propose extensions to SQL3/OGIS to support such queries.
Shape-based queries, which belong to the family of content-based retrieval, have not been modeled by
OGIS data model yet. It is clearly a nontrivial task and is now a topic of intense current research.
Intuitively, for shape-based queries, suppose we have already had the target shape clearly defined (such as
“squarish census block”, or “ladies boot” in this case), we can simply compare the shape of Query Shape
with Target Shape. However, we can not do this directly by using OGIS operators such as Overlay or
Difference, because of the following 2 reasons. 1) The Query Shape and the Target Shape may not be of
the same size. 2) The Query Shape and the Target Shape, even they might be of the same size, or same
shape, might be of different orientations. So before we compare the Query Shape and the Target Shape, we
first need some translation, rotation, and rescaling operations.
A simple idea is as follows:
1) Find out the long axis of both the Query Shape and the Target Shape. If the orientations of the long axis
of these two different shapes are different, then we can first Rotate the Query Shape, and make its long axis
match with that of the Target Shape.
2) After the rotation, we can Resize the Query Shape, and make it approximately the same size as the
Target Shape.
3) After make the Query Shape and the Target Shape of the same size, and same orientation, we can use
OGIS operator “Difference”, and find out the portion of the geometry of the Query Shape that does not
match with the Target Shape.
4) We can define a threshold value, and decide whether the “Difference” of these two shapes is small
enough so that we can say that they are of the same shape.
Current studies on Shape-Based Retrieval have invented various algorithms for shape description and
comparison. Fourier descriptors (FD), curvature scale space (CSS) descriptors (CSSD), Zernike moment
descriptors (ZMD), and grid descriptors (GD) are some of the examples. For example, Fourier descriptors
are obtained by applying Fourier transform on shape boundaries. Both the Query Shape and the Target
Shape are represented using a set of coordinates. Then the centroid distance function is applied so that the
centroid distance representation is invariant to shape translation. The centroid distance function is as
follows: ri = ([xi – xc]2 + [yi – yc]2)1/2, i = 1, 2,…, L, where xc and yc are average of x coordinates and y
coordinates respectively. In order to apply Fourier transform, all the shapes in database are normalized to
the same number of boundary points. Then discrete Fourier transform is applied. The coefficients
calculated from Fourier transformation are called Fourier Descriptors. The FDs acquired in this way is
translation invariant due to the translation invariance of centroid distance. To achieve rotation invariance,
phase information of the FDs is ignored and only the magnitudes |FD n| are used. Scale invariance is
achieved by dividing the magnitudes by the |FD0|. The similarity measure of the query shape and a target
shape in the database is simply the Euclidean distance between the Query Shape and the Target Shape
feature vectors.
Chapter 4
4.3
a) R-tree duplicates pointers to these large objects across multiple nodes of an index. It is a search
down multiple path in the index and duplicate elimination postprocessing. While grid-file decompose
the large objects into smaller fragments in its grids. It leads to additional postprocessing to merge
fragments to reconstruct the object.
b) The strength of the scheme is that it’s more convenient for people to handle large spatial objects. It
can also speed up the search of a data file by indexes each group independently. The weakness is that
with the extension of the size of spatial objects, search time will be slow.
4.4 Compare and contrast the following terms:
a)Clustering vs. Indexing
Clustering and indexing are both used to speed up the DBMS data retrieval performance, they differ in the
way they accomplish the improvements. Clustering attempts to organize the data on disk in such a way that
it will be easy for the DBMS to locate geographically close objects on disk. Indexing does not concern
itself with how the records are stored on disk, rather, it maps key fields to disk pages, so for every distinct
key in the database, the DBMS can quickly look up its key field in the sorted index file, and immediately
know where on disk the object resides.
b) R Trees vs. Grid File
Grid files are simple fixed grids that are laid over a structure that divides the structure into the grid units. A
more advanced type of grid file is a non-uniform grid, which like the name specifies means that the grids
can be different sizes. Grid files rely on a grid directory which is simply of place to keep track of which
bucket each data entry is located. Grid files are efficient for I/O costs but require a large main memory for
the grid directory. An R-tree represents spatial data in the form of a B-tree. Like B-trees there are different
kinds of R-trees (R+-trees for example). In an R-tree each spatial item is thought of as a rectangle. The
rectangles then are subdivided in the same manner as data is for B trees except the rectangles are allowed to
overlap. The overlapping rectangles are dealt with in different ways depending on the type of R-tree used.
In an R tree each rectangle can only have parent node.
c) Hilbert Curve vs. Z-Curve
These are two clustering methods that are used in spatial databases; to try and keep geographically close
objects close to each other in the database. The Z-Curve is named for the ‘Z’ pattern that is traced when
you access sequential objects from the database. An object Z-value is determined by interleaving the bits
of its x and y coordinates, and then translating the resulting number into a decimal value. This new value is
the index on the path that the object will be on. Hilbert curves accomplish the same idea, but the method is
slightly different. The Hilbert value of an object is found by interleaving the bits of its x and y coordinates,
and then chopping the binary string into 2-bit strings. Then, for every 2-bit string, if the value is 0, we
replace every 1 in the original string with a 3, and vice-versa. If the value of the 2-bit string is 3, we
replace all 2’s and 0’s in a similar fashion. After this is done, you put all the 2-bit strings back together and
compute the decimal value of the binary string; this is the Hilbert value of the object. With Z-Curves, there
are some instances where we move from one object to another that is not adjacent, with Hilbert curves we
avoid this inefficiency. However, it is possible for objects that are clustered using a Hilbert curve to be
geographically adjacent, but not adjacent in the trace of the Hilbert translation. Hilbert curves are generally
better than Z-Curves because the Hilbert curves don’t have any diagonal lines but the Hilbert curve costs
more computationally than the Z-curve.
d) Primary Index vs. Secondary Index
Both of these indices are 2 field files that specify what page on disk a particular record resides. The
difference is that with a primary index, the database table is ordered on the disk, so only the first record of
each disk page is needed, because when you search the primary index for a certain record, you just need to
find the page its on, not the exact location, and since the file is ordered, the index can be much smaller.
With a secondary index, the files are unordered, so every unique key must have an index entry because any
row of the table can be on any disk page.
4.5 (pp.112) Which of the follow properties are true for R-trees, grid files, and B-trees with Z-order?
Balance
R-tree, B-trees with Z-order
Fixed depth
Grid file
Nonoverlapping
R-tree, Grid file, B-trees with Z-order
50 percent utilization
per node
B-trees with Z-order
4.6 a) Create an final grid structure for the following data points. Assume each disk page can hold at
most three records. Clearly show the dimension vector, directory grids, and data page. The data
points are: (0,0), (0,1), (1,1), (1,0), (0,2), (1,2), (2,2), (2,1), (2,0). The utilization rule is 50%.
b) Create an R-tree for the following data points. Assume each intermediate node may point to
three other nodes or leafs and each leaf may point to three data points. The utilization rule is 50%.
The data points are: (0,0), (0,1), (1,1), (1,0), (0,2), (1,2), (2,2), (2,1), (2,0).
4.7 Draw a final grid file structure generated by inserting records with the following sequence of
point-keys: (0,0) (1,1)(2,2)(3,3)(4,5)(5,5)(5,1)(5,2)(5,3)(5,4)(5,0). Assume each disk page can hold at
most two records.
4.8 Repeat the above with B-tree with Z-order.
111
110
C
*
101
B
*
100
B
*
011
001
000
D
*
D
*
*A
D
*
*A
000
Object
A
A
B
B
B
C
C
C
C
*
B
*
010
C
*
001
Points
1
2
1
2
3
1
2
3
010
x
000
001
010
011
100
101
101
101
011
y
000
001
010
011
100
101
100
011
100
101
interleave
000000
000011
001100
001111
110000
110011
110010
100111
110
z-value
0
3
12
15
48
51
50
39
111
D
D
D
1
2
3
101
101
101
010
001
000
100110
100011
100001
38
35
33
4.9 Repeat the above question with R-tree.
The points are arranged as:
We can assume that each point represents a MBR
5
*
4
**
3
* *
2
*
*
1 *
*
0 *
*
01 2345
We know that each index holds three pointers, and leaf nodes can hold only two records.
Since we have 11 points, we need to split the area into 6 areas to contain the points.
This is our split diagram:
5
|*
4
*|*
3
2
*
And after we assign labels:
5
A
|
X
4
| B
* |*
|*
3
2
C
|
| D
|*
|*
3
2
E
|
Z
| F
1 *
0 *
01 234|5
Y
01 234 |5
Here, X covers the area of A and B, Y covers the area of C and D, etc.
Then, the R-Tree representation is:
●R
○X
○Y
○A
○B
○C
○ (4,4)
○(5,5) ○(5,4)
○(2,2) ○(3,3)
○Z
○D
○(5,3) ○(5,2)
○E
○(0,0) ○(1,1)
○F
○(5,1) ○(5,0)
4.10 Consider a set of rectangles whose left-bottom corners are located at the set of points given in
the previous question. Assume that each rectangle has length of three unties and width of two units.
Draw two possible R-trees. Assume an index node can hold at most three pointers to other index
nodes. A leaf index node can hold pointers to at most two data records
(Group 7 submitted as hard copy)
4.11 Compute the Hilbert function for a 4x4 grid shown in Fig. 4.9 with the origin changed to
bottom-left corner.
11| 0101 0111 1101 1111
|
10| 0100 0110 1100 1110
|
01| 0001 0011 1001 1011
|
00| 0000 0010 1000 1010
-------------------00 01 10 11
11| 11 12 21 22
|
10| 10 13 20 23
|
01| 01 02 31 32
|
00| 00 03 30 33
--------------00 01 10 11
11| 11 12 21 22
|
10| 10 13 20 23
|
01| 03 02 31 30
|
00| 00 01 32 33
--------------00 01 10 11
11| 5 6 9 10
|
10| 4 7 8 11
|
01| 3 2 13 12
|
00| 0 1 14 15
--------------00 01 10 11
4.12 What is an R-Link Tree? What is it used for in Spatial Databases?
R-Link Trees are a modified R-Tree indexing technique. They are the spatial analogue of the B-Link Tree,
developed to handle concurrent access to underlying data.
Basically, the leaves nodes are connected in a linked list. Nodes are assigned a logical sequence number,
with each subsequent number being larger than those previous. When a node is split, the old number goes
to the new node, while a new number is assigned to the older node. Effectively, the linked list runs from
left to right, and the assigned numbers should decrease. Should a split occur and not yet be detected by the
parent level, transversal can reveal this by finding a node which is out of order, since no parent node would
currently point to it.
4.13
(Group 9 did not answer this)
4.14 What is a join-index? List a few applications where join-index may be useful
A join-index is a data structure used for processing join queries in databases. A join-index describes a
relationship between the objects of two relations.
Example: Assume that each tuple of a relation has a surrogate (s system-defined identifier for tuples, pages,
etc.) that uniquely identifies that tuple. A join-index is a sequence of pairs of surrogates in which each pair
of surrogates identifies the result-tuple of a join.
The tuples participating in the join result are given by their surrogates. Let A and S be two relations, now
we consider the join of R and S on attributes A of R and B of S.
Formal definition of join-index could be expressed as:
JI = {(ri, sj) | F (ri.A, sj.B) is true for ri 'R and sj ' S}
Join-indices use pre-computation techniques to speed up online query processing and are useful for datasets
that are updated infrequently.
Applications that join-index might be used:
1) For example, in the StatePark example, since info about StatePark and Lake is usually stable, and need
to be updated infrequently, we can creat a join-index between StatePark and Lake, and pre-compute the
join-result, so that if people want to perform query on them, query processing will be speeded up a lot.
2) To study the relationship between HazardousSite, and the Species in the
ForestStand, a join-index can also be created between the databases of
ForestStand, and BufferOfHazardousSite.
4.17
a. Nodes b and c
b. Nodes a, c, d, e, f
4.18
a.) For a point query, we would need to pull rectangle 5 from nodes b and c in intermediate node x.
b.) The range query performs the same as in 4.17. Nodes a, c and d within intermediary node x and nodes e
and f in intermediary node y.
4.20. Compute Hilbert values for the pixels in object A, B, and C in Figure 4.7. Compare the
computational cost of range searches for objects B and C using Hilbert curve ordered storage.
x
00
01
10
11
00
01
10
11
00
01
10
11
00
01
10
11
y
00
00
00
00
01
01
01
01
10
10
10
10
11
11
11
11
interleaved
0000
0010
1000
1010
0001
0011
1001
1011
0100
0110
1100
1110
0101
0111
1101
1111
Decimal
Updated decimal
00
03
30
33
01
02
31
32
10
13
20
23
11
12
21
22
00
01
32
33
03
02
31
30
10
13
20
23
11
12
21
22
Updated binary
0000
0001
1110
1111
0011
0010
1101
1100
0100
0111
1000
1011
0101
0110
1001
1010
Hilbert value
0
1
14
15
3
2
13
12
4
7
8
11
5
6
9
10
Assuming that the query is to retrieve all objects in the regions of A, B and C, and furthermore assuming
that the objects are stored in a B+-tree ordered in Hilbert-order with 16 leaves and an effective fanout of f,
the strategy of executing the query is as follows.
1. Convert the regions into Hilbert values.
A={5},
B={8, 9, 10, 11},
C={1,14}.
2. Find maximal contiguous series of Hilbert values. If the series are contiguous, it is enough to
follow the links between the leaves; otherwise we have to search the entire tree.
A={{5}},
3.
B={{8, 9, 10, 11}},
C={{1},{14}}
Accordingly, our method of retrieving the objects is to descend from the root to 1 and retrieve all
objects from 1, then descend from root to 5, retrieve all objects from 5. Next, from the root,
descend to 8, retrieve all objects from 8, then move on to 9, 10, 11 retrieving all objects from 9,
10, 11. Finally, from the root, descend to 14 and retrieve all objects from 14.
Figure 1: Left: The Hilbert-values for Figure 4.7 in the SDB book
Right: The range queries. The red arrows indicate the
leaves where the tree has to be searched. Other
leaves can be accessed by following the links between adjacent nodes in the B+tree.
The cost of this formulation is
4 * log f 16  3.
disk pages.
4.22. Since this problem is exactly the same as Question4 in the mid-term exam, so I will skip it.
Chapter 5
5.3. What is special about query optimization in SDBs relative to traditional relational databases?
The three major differences between spatial and traditional (non-spatial) databases are
1. Lack of fixed set of operators
2. Lack of natural spatial ordering and
3. Evaluation of spatial predicates is more expensive.
The consequences of these differences are that (a) the system does not know the cost of user-defined
operators, (b) due to the lack of natural ordering new access methods are needed and (c) the premise that
I/O cost is the dominant cost (relative to the CPU cost) may not hold.
Optimizers worked around these problems by (i) introducing the filter-and-refine paradigm and (ii)
developing new access methods.
The Filter-and-refine paradigm. Instead of handling the potentially complex objects directly, the SDB
engine stores and processes the minimum bounding rectangles (MBRs) of these objects. In the filter step,
which is performed by the engine itself, only the MBRs are considered with a restricted set of operators
[overlap, join, NN operator] and the evaluation of the actual predicate on the complex objects themselves
are performed in the refine step, which responsibility is usually delegated back to the application. This
paradigm solves (a) by using only a limited set of spatial operators [usually overlap], for which the cost is
known and reasonably low when applied to MBRs. This also solves (c).
New Spatial Access Methods. (i) Either an ordering is forced onto the multi-dimensional space [such as
Hilbert curve] or (ii) inherently multi-dimensional indices are used [e.g. R-tree].
5.4
a. With no R-Tree indices available, we are forced to use the brute force nested loop join strategy, which
will compare all possible joined tuples.
b. When one relation has an index we can use the nested loop with index. By using that relation with the
index as the inner loop of the nested loop we do not need to scan the whole relation each iteration, we can
reduce it to a range scan of the index.
c. If both relations have an index, we can take advantage of the tree matching strategy. By using the RTree indices for each relation, we can achieve a lower bound of the I/O cost = IR1 + IR2 where IRn is the
number of data pages for the relation Rn.
5.5 Which join strategies can be used for spatial join, with join predicates other than overlap?
 Nested Loop – check every possible pair
 Tree Matching – Use R-Tree indices on both relations (if available)
 Space Partitioning – Objects are compared if they share a region.
5.6 Tree transformation: draw query trees for the following queries:
a) which city in city table is closest to a river in the river table (section 3.5)? put SQL equivalent.
b) List countries where GDP is greater than that of Canada
c) List countries that have at least as many (lakes/neighbors) as France.
5.7 Apply the tree transform rules to generate a few equivalent query trees for each query tree
generated above [in problem 5.6](Group3 did not answer this questions)
5.8 Consider a composite query with a spatial join and two spatial selections. However, this strategy
does not allow the spatial join. Take advantage of the original indexes on lakes and facilities. In
addition, does writing and reading of file 1 and file 2 add into costs? Devise an alternate stratey to
simultaneously process njoin and the two selections underneath.
The query has two spatial selection and one spatial join and both of the files have indexes. If spatial
indexes are available on both relations, the tree matching strategy can be used. If IR1, IR2 the number of
pages occupied by the nodes of the index tree for relations R and S respectively, the minimum I/O cost is
IR1+IR2.
The total cost is I/O cost plus the join cost
IR1+IR2 + ((js * |R| * |S|) bfrRS)
The notation:
js --- join selectivity
|R| ---number of records in R
|S| --- number of records in S
bfrRS ---blocking factor for join
5.9 We focused on strategies for point-selection, range-query selection, and spatial join for overlap
predicates in this chapter. Are these adequate to support processing of spatial queries in
SQL3/OGIS?
They provide basics but as it is mentioned in chapter three, operations are limited to simple select-projectjoin queries. Support for directional predicates is missing. Spatial aggregate queries poses problems. It also
needs support on shape-based and visibility- based operations.
5.10 How can we process various topological predicates (e.g. inside, outside, touch, cross) in terms of
the overlap predicate as a filter, followed by an exact geometric operation?
For this, we can use MBRs since they provide an affordable way to approximate geometry, giving a reliable
transformation from an MBR relation to the spatial region contained within. By applying the overlap
operation to these, and judging by figure 2.3 on page 30, we can see that this would eliminate both disjoint
and meet from consideration should the operation fail. This essentially saves time by removing clearly
disjoint objects before advancing onto the exact geometry test with the remaining 6 operations.
5.11 Design an efficient single-pass algorithm for nearest neighbor query using one of the following
storage methods: Grid file, B+ tree with Z-order.
The search algorithm for nearest neighbor starts with grid files and locate with the buckets. The grid
directory entries share the same bucket. It means points or records belong to these regions are stored in the
same bucket. The algorithm uses search pruning strategies.
5.12
(Group 8 did not answer this)
5.13 Compare the computational cost of the two-pass and one-pass algorithms for nearest neighbor
queries. When is the one-pass preferred?
The cost for two-pass is clearly the cost for a point search added to the cost for a range query. The cost for
a point search is O(logBn), where B is the blocking factor. The cost of the range query is O(logBn) +
(Query-size / Clustering eff.) Therefore the entire cost would be approximately 2 * ( logBn) + (Query-size /
Clustering eff.). The cost for a one-pass is the cost of tree transversal: O(log n). One pass would be
preferable when the blocking factor is high and the data set sufficiently sparse.
5.14 Compare and contrast
(i) parallel Vs. distributed databases.
Parallel database management systems utilize parallel processor technology employing one of the three
main architectures namely a)shared memory architecture b)shared disk architecture c) shared nothing
architecture. Distributed database is a collection of multiple, logically interrelated databases distributed
over a network. Here, the architecture could be a client-server system or a collaborating system. Although,
the share-nothing architecture in parallel databases resemble distributed database system, there are
differences in the mode of operation. In parallel systems, there is symmetry and homogeneity of nodes
whereas in distributed systems, heterogeneity of hardware and operating system at each node is common.
(ii) declustering Vs dynamic load balancing
Declustering refers to dividing the set of data items in query processing among various disks such that
response time is minimized. Depending on when data is divided and allocated, declustering methods fall
into two categories : static load balancing where partitioning is done before computation and dynamic load
balancing where data is partitioned during run time. Most often, both static declustering and dynamic load
balancing are required because of highly non-uniform data distribution, large variation in size and extent of
spatial data.
(iii) shared nothing Vs shared disk architectures
In shared nothing (SN) architecture, each processor is associated with a memory and some disk units that
can only be accessed by that processor. In shared disk (SD) architecture, each processor has a private
memory that can be accessed only by that processor, but all processors in the system can access all the
disks in the system. SN architecture minimizes the interferences among different processors and hence
scalability is improved. But load balancing could be a greater challenge than in other two cases; also, data
on a disk could become unavailable if a processor fails. SD architecture, is halfway between SN and SM as
far as resource sharing goes; hence is more scalable than SM and load balancing is easier than in SN.
Communication overhead is less than in SN and synchronization is easier.
(iv) client server Vs collaborative systems
In a client server system, there are one or more client processes and one or more server processes; a client
process can send a query to any server process. Clients are responsible for user interface and server
manages the data and executes transactions. A collaborative system has a collection of servers and each of
these can run transactions on local data; they cooperatively execute transactions that span various servers.
Client server system can be easily implemented; it provides for a better utilization of expensive servers.
Each module can be optimized thus reducing the volume of transferred data. If the functionality of server
needs to overlap that of client, a client server system would need more complex clients whereas this is
easily handled in collaborative systems since there is no distinction between the two.
5.15. What is GML? Why is it interesting to spatial databases and GIS?
GML is Geography Markup Language. It is a modeling language for geographic information, and encoding
for geographic information. It is designed for the web and web-based services. It is an OpenGIS
implementation specification. GML is based on XML technologies, and implements concepts of the ISO
19100 series. It supports spatial and non-spatial properties of objects. It is extensible, open and vendorneutral.
Characteristics of GML:
1) It supports the description of geospatial application schemas for information communities
2) enables the creation and maintenance of linked geographic application schemas and datasets
3) supports the transport and storage of application schemas and data sets
4) increases the ability of organizations to share geographic application schemas and the information
they describe
5) leaves it to implementers to decide whether application schemas and datasets are stored in native
GML or whether GML is used only for schema and data transport
GML is interesting to spatial databases and GIS because it provides support for the geometry elements
correcponding to the Point, Linestring, LinearRing, Polygon, MultiPoint, MultiLineString, MultiPolygon,
and GeometryCollections. It also provides the coordinate element for encoding coordinates and the box
element for defining spatial extents. We can use GML representation to build truly interoperable distributed
GIS.
5.18
Assume that the system response time is the slowest response time of a single processor. Therefore, if one
disk I/O is t and the other disk I/O is 2t, the system I/O is 2t.
(1) Linear and CMD methods distribute 8 data items in one row into different 8 disks, so the row query
speed-up is 8, using linear and CMD methods.
Z-curve method distribute 8 data items in one row into 4 different disks with 2 items into 1 disk, so the row
query speed-up is 4, using z-curve method.
Hilbert method distributes 8 data items in one row into 7 or 8 different disks, so the row query speed-up is
4 or 8, using Hilbert method. I/O speed-up for lowest tow is 4.
(2) Linear and CMD methods distribute 8 data items in one column into different 8 disks, so the column
query speed-up is 8, using linear and CMD methods.
Z-curve method distribute 8 data items in one column into 2 different disks with 4 items into 1 disk, so the
column query speed-up is 2, using z-curve method.
Hilbert method distributes 8 data items in one row into 5 or 6 different disks. Sometimes 2 items is in one
disk, but sometime 3 items is in one disk. So the column query speed-up is 4 or 8/3, using Hilbert method.
I/O speed-up for leftmost column is 4.
(3) The speed-up for range query depends on the query range. In the bottom 4 rows of 2 leftmost columns,
there are at most 2 cells in one disk for linear method, so the speed-up, using linear method is 4. There are
at most 2 cells in one disk for CMD method, so the speed-up, using CMD method, is 4.
There are at most 2 cells in one disk for z-curve method, so the speed-up, using z-curve method is 4.
There are at most 1 cell in one disk for linear method, so the speed-up, using Hilbert method is 8.
Table 1. gives the I/O speed-up for query 1, 2, and 3 in problem 5.18
1. Row query
2. Column query
3. Range query
Linear
8
8
4
CMD
8
8
4
Z-Curve
4
2
4
Hilbert
4
4
8
Chapter 6
6.1 A unique aspect of spatial graphs is the Euclidian space in which they are embedded, which
provides directional and metric properties. Thus nodes and edges have relationships such as left-of,
right-of, north-of, and so on. An example is the restriction “no left turn” on an intersection in
downtown area. Consider the problem of modeling road maps using the conventional graph model
consisting of nodes and edges. A possible model may designate the road intersections to be the nodes,
with road segments joining the intersections to the edges. Unfortunately, the turn restrictions (e.g.,
No Left Turn) are hard to model in this context. A routing algorithm (e.g., A*, Dijkstra) will consider
all neighbors of a node and would not be able to observe turn restriction. One way to solve this
problem is to add attributes to nodes and edges and modify routing algorithms to pay attention to
those attributes. Another way to solve the problem is to redefine the nodes and edges so that the turn
restrictions modeled within the graph semantic and routing algorithms are not modified.
Propose alternative definitions of nodes and edges so that routing algorithms are not modified.
One way to resolve this issue would be to assign very high costs to the nodes. As an example if going from
node 1 to node 2 is undesirable (no left turn as an example) then cost assigned to 1->2 can be a very
undesirable value to prevent the algorithm to choose that selection. Another approach would be to create a
list of conflicting nodes. If pair-wise nodes are in the conflicting nodes list then algorithm would not
choose them.
6.2 Study internet sites (e.g., mapquest.com) providing geographic services (e.g., routing). List
spatial data types and operations needed to implement these geographic information services.
Answer1: For this question, I surveyed MapQuest and YahooMap. These services use graph data types
(i.e., graph, node, and edge), which are commonly used for network structures. The services offered by
these sites focus on network analysis queries with specific emphasis on shortest path queries. I recognized
two main variants on the shortest path query; minimum distance, and fastest path. Both might use the Bestfirst search algorithm (or one modified by edge weights). The minimum distance query returns the absolute
shortest network distance. The fastest path query uses a weighted network. These weights might be
assigned based on either speed limit or an edge (i.e., road) classification scheme (e.g, primary four-lane,
primary two-lane, secondary, all the way down to jeep trail).
The shortest path query with only the minimum distance constraint probably uses a single pair path
computation, which returns only that route with the minimum distance between the start and stop vertices.
Whereas the fastest path search might select the top k shortest paths, then considers the edge specific
weights to find the optimal path between the start and stop vertices. This select and refine approach is
similar to the minimum bounding box approach.
MapQuest also offered a scenic route to the specified destination. This type of query could also fall under
the shortest path; however, the algorithm used to select the scenic route finds a minimized path that touches
the most predefined “scenic” vertices. This query would be performed with constraints that the scenic path
is within a reasonable detour of the shortest or fastest path.
Regardless of the type of query selected, the response is a set of directions that summarizes travel
directions. Compiling these directions in a way that is useful to the user is not a trivial task. The directions
must note: edge name changes, direction changes at vertices that are important to the navigator, and
landmark points which might not be part of the original graph structure. This last operation requires a join
query which might couple a point or polygon layer with the network layer and employ some of standard
OGIS operations.
Answer2:
Internet sites providing geographic services (e.g. mapquest.com) has the following typical services. The
right side of the table shows corresponding data types and operations.
Services
Data types
Operations
Street map, Geocoding
map data (e.g., shape file)
point query
R-tree or Grid access
OGIS operations
(e.g., SpatialReference(),
Boundary())
Direction (Routing)
spatial network
path query
(e.g., vertex, edge, etc.)
graph partitioning
graph management operations
(e.g., addEdge, getSuccessors)
Proximity searching
map data
range query
spatial network
nearest neighbor search
R-tree or Grid access
distance scan
graph partitioning
6.5 We learned about graph operators. Is this operator relevant to spatial databases with point,
linear, or polygonal objects? Consider the GIS operation of reclassify, for example, given an
elevation map, produce a map of mountainous regions (elevation > 600 m). Note that adjacent
polygons within a mountain should be merged to remove common boundaries.
Graph operators are very relevant to spatial databases. Where it gets difficult in its implementation,
however, is that there are three different data types within these databases: points, lines, and polygons.
Graph operators by definition are given by edges and vertices in which these edges are the connections
between the vertices. Thus, it would seem only logical that graph operators would work well in spatial
databases for certain data types. It almost seems as if these operators were custom designed for the
combination of point and line data, with the points representing the vertices and the lines representing the
edges. A good example of this would be the nodes of a road network, where the edges would be the
different road segments and the points would be important steps along that network, such as an intersection
with another edge. However, trying to apply these operators to more complex polygon data seems to be a
bit more difficult. Because polygons are comprised of line segments, which are then comprised of points, it
might be difficult to employ such operators to a polygon dataset. If one were to apply these polygon
datasets to graph operators as is, one would assume that the polygons would take on the role of edges while
its line segments make up its vertices. But, how can one define a polygon solely on the two vertices it (the
edge) falls between? To me, it just does not seem possible to effectively employ graph operators to
polygons as is. However, it we are defining elevation as being continuous across the polygon itself, it
would be possible to define the elevation as a single point within the polygon, such as its geographic center.
We would lose some of the geographic extent of these data with a conversion, but implementing this
change would fit the data better to the graph operators. We could use the vertices to represent these
elevation points while the edges would represent the generalized slopes between these elevation points.
This would in effect create something very similar to what GIS users know as the triangular irregular
network.
I understand that taking this approach to the problem may demonstrate my lack of experience with these
sorts of operators before this class and some of the difficulty I had in grasping some of these more abstract
operators (I think my mind is trained on looking at things more geographically, which is why I see point
elevations and slopes when I see a drawing of graph operators along a network). Here is an example map
of what I describe above:
620m
These points represent the geographic center
of elevation for these polygons, assuming that
elevation is uniform across the polygon.
540m
670m
620m
80
540m
50
130
670m
The points now represent the vertices in
G  (V , E ) , while the slope lines represent
the edges. The edges assume a distance
between points of 1 unit.
6.6 Given a polygonal map of the political boundaries of the countries of the world, do the following:
Classify the following queries into classes of transitive closure, shortest path, or other
a) Find the minimum number of countries to be traveled through in going from Greece to China. Shortest
Path
b) Find all countries with a land-based route to the US. Other
6.7 Most of the discussion in this chapter focused on graph operations (e.g., transitive closure,
shortest path) on spatial networks. Are OGIS topological and metric operations relevant to spatial
networks? List a few point and OGIS topological and metric operations.
Intersect and Distance can be used to find shortest path.
Intersect, Touch, and Cross can be used to find transitive closure.
6.8 Propose a suitable graph representation of the River Network to precisely answer the following
query: Which rivers can be polluted if there is a spill in a specific tributary?
We should consider river segments instead of total rivers. The textbook’s initial graph representation of the
River Network models the rivers as nodes and the “falls into another” relationship as directed edges (pages
151 and 157). This graph, however, will not help us answer the above query; instead, we must change the
graph to make the confluences, heads, and mouths of rivers into nodes and the river segments into edges.
Thus, if a given river segment is polluted, we can trace downstream to identify segments that may get
polluted.
Transforming this graph into an appropriate graph yields a result similar to the BART graph (page 156).
The BART RouteStop table has columns ‘routenumber’, ‘stopid’, and ‘rank’. In the new River Network
graph, riverID is equivalent to ‘routenumber’ and confluenceID is equivalent to ‘stopid’. We can also rank
the confluenceIDs for a given riverID to make a rank column.
The River table stays the same for this new graph representation. Instead of making the entire
RiverConfluence table, we will draw the graph representing the nodes and directed edges (Figure 1).
River
riverID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Name
Mississippi
Ohio
Missouri
Red
Arkansas
Platte
Yellowstone
P1
P2
Y1
Y2
Colorado
Green
Gila
G1
G2
Gl1
Gl2
1
2
7
3
8
32
31
4
9
14
5
10
11
6
12
13
15
30
29
16
22
19
20
28
17
24
21
23
25
26
18
27
Figure 1. New graph
representation of River
Network
RiverIntersection
RiverID
IntersectionID
1
6
1
5
1
4
1
3
1
2
1
1
2
9
2
3
2
2
2
1
3
12
3
11
3
10
3
5
3
4
3
3
3
2
3
1
4
7
4
2
4
1
5
8
5
3
5
2
5
1
6
13
6
10
6
5
6
4
6
3
6
2
6
1
7
14
7
11
7
10
7
5
7
4
7
3
7
2
7
1
This table shows the river networks through the Yellowstone River. A complete table for all river segments
will be very long.
6.9 Extend the set of pictograms discussed in Chapter2 for conceptual modeling of spatial networks.
The differences between ER model and spatial network
 In spatial network each node represents one object but in ER model each node (entity) represents
bunch of nodes (records)
 The relationships (connections) in spatial network are directed. There is no direction between
entities in ER model.
The node and connections can be represented by using pictograms.
Bidirectional graph
Unidirectional graph
Node
To represent node, we used circle whose inside is not full. The full circle is used to represent point value in
ER model.
6.10 Produce denormalize node table similar to Figure 6.2(e) for the River Network example.
Riverid
Fallsinto
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Merge of
1 (2,3,4,5)
1
1 (6,7)
1
1
3 (8,9)
3 (10,11)
6
6
7
7
12 (13,14)
12 (15,16)
12 (17,18)
13
13
14
14
2
4
5
8
9
10
11
15
16
17
18
6.11 Extend the logical data model for spatial network by adding a sub-class “Path” for the class
“Graph”. List a few useful operations and application domains. List a few other interesting sub-class
of “Graph”.
If we extended the Graph class with a Path class, we would be able to store multiple path information on a
single network, not just the shortest path. In the bus example, we could keep a larger graph instead of
having to create and store multiple sub-graphs that signify a single route. For this, each entry in the Path
class would be independent and unconnected from each other.
The above example makes some assumptions on how the graph is stored and what it represents. Perhaps a
better example would be in a road map. The Path class helps in case the optimal shortest path has been
altered, possibly by construction work. By using path entries we can quickly retrieve alternate routes.
If we use the model as in the first example, our graph shows nodes from other routes along with a singular
path. Then we might need an extension to show which nodes belong to which path and whether the routes
they belong to intersect or in general, the overlapping of paths.
6.12 Turn restrictions are often attached to road networks, for example, left-turn from a residential
road onto an intersecting highway is often not allowed. Propose a graph model for road networks to
model turn restrictions. Identify nodes and edges in the graph. How will one compute the shortest
path honoring turn restrictions?
The road network model has the turn-restrictions. There are five lines and ten routes cross through the road
network. All the lines converge in one point, such as downtown, and then spread out in sporadic directions.
Use dijkstra’s algorithm to find the shortest path. It can be used to solve the single-source problem. It
visits an immediate neighbor of a source code and its successive neighbors recursively before visiting other
immediate neighbors.
6.13
(Group 8 did not answer this)
6.14 Compare and contrast Dijkstra's Algorithm with Best-First Algorithm for computing the
shortest path.
Dijkstra's algorithm finds the shortest path connecting all nodes possible starting with a specified source,
while Best-First finds the shortest path leading from a specified source to a specified end point. Dijkstra's
algorithm always gives (one of) the best spanning tree(s), while Best-First is heuristic, and while it usually
gives the shortest path from A to B, it doesn't always. Dijkstra's algorithm links all nodes, Best-First doesn't
always link all nodes ñ only those that it passes through while transversing from A to B.
6.15 What is a hierarchical algorithm for shortest path problem? When should it be used?
Hierarchical algorithm partitions a large graph into fragment graphs and constructs a boundary graph for
these fragments. These graphs would be much smaller than the original graph. The process continues on the
boundary graphs till a graph of reasonable size (could mean a graph that fits in the main memory) is
obtained. Proper construction of the boundary graph enable the shortest path query on the original graph to
be decomposed into a set of shortest path queries on fragments without compromising the optimality of the
path retrieved. The algorithm consists of three basic steps: Computing the relevant node pairs in the
boundary graph: by finding the boundary graph nodes of the fragments of source node and destination
nodes. Computing the boundary path: by finding the shortest path from the source node fragment to the
destination node fragment. Expanding the boundary path: by expanding the path within the fragment.
Hierarchical algorithm can be used if the graph is too large to fit in the main memory. It reduces the main
memory requirements and the I/O costs.
6.16 Compare and contrast adjacency-list and adjacency-matrix representation for graphs. Create
adjacency-list and adjacency-matrix representation for following graphs
(i)
Graph G in Fig 6.4
Adjacency-list representation:
1  2, 5
23
34
4  NULL
53
Adjacency-matrix representation:
1
2
3
4
5
1
0
1
0
0
1
2
0
0
1
0
0
3
0
0
0
1
0
4
0
0
0
0
0
5
0
0
1
0
0
(ii)
Graph G* in Fig 6.4
Adjacency-list representation:
1  2, 3, 4, 5
2  3, 4
34
4  NULL
5  3, 4
Adjacency-matrix representation:
1
2
3
4
5
1
0
1
1
1
1
2
0
0
1
1
0
3
0
0
0
1
0
4
0
0
0
0
0
5
0
0
1
1
0
(iii)
River network in Figure 6.3
Adjacency-list representation:
1  NULL
21
31
41
51
63
7 3
8 6
9 6
10  7
11  7
12  NULL
13  12
14  12
15  13
16  13
17  14
18  14
Adjacency-matrix representation:
1 2 3 4 5 6 7 8 9
1
0 0 0 0 0 0 0 0 0
2
1 0 0 0 0 0 0 0 0
3
1 0 0 0 0 0 0 0 0
4
1 0 0 0 0 0 0 0 0
5
1 0 0 0 0 0 0 0 0
6
0 0 1 0 0 0 0 0 0
7
0 0 1 0 0 0 0 0 0
8
0 0 0 0 0 1 0 0 0
9
0 0 0 0 0 1 0 0 0
10 0 0 0 0 0 0 1 0 0
10
0
0
0
0
0
0
0
0
0
0
11
0
0
0
0
0
0
0
0
0
0
12
0
0
0
0
0
0
0
0
0
0
13
0
0
0
0
0
0
0
0
0
0
14
0
0
0
0
0
0
0
0
0
0
15
0
0
0
0
0
0
0
0
0
0
16
0
0
0
0
0
0
0
0
0
0
17
0
0
0
0
0
0
0
0
0
0
18
0
0
0
0
0
0
0
0
0
0
11
12
13
14
15
16
17
18
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6.19 Revisit map-reclassify operation discussed in Section 2.1.5. Discuss if SQL3 Recursion and OGIS
operations may be used to implement it.
It may be possible to implement SQL3 recursion on spatial features that are recursive by the definition.
Contain, inside, cover, covered-by may use SQL recursion to find object relations since these operations
can progressively deepen. In the example given in 2.1.5 merge operation used to remove boundaries. This
is very similar to recursion so SQL3 recursion may be used for this type of operations.
6.20
Concept
Benefits
Comple
-xity
breadth first search
BFS algorithm visits all
nodes that are reachable
from v
If the tuples of the edge
relation are physically
clustered according to
the value of their source
nodes, the majority of the
members in the
adjacency list are likely
to be found on the same
disk page
O(V + E)
Dijkstra’s algorithm
Single-source shortest
path problem solution
This algorithm can be
applied to a weighted,
directed or undirected
graph for the case where
all edge weights are
nonnegative.
O(n2)
best first search
Use heuristic function to
Underestimate the cost
of two nodes
The best-first search has
been a framework to
speed up other
algorithms by using
semantic information
about a domain.
Depends on heuristic
function
Breadth-first search finds shortest paths in graphs whose edges have unit length. Thus, if Dijkstra’s
algorithm is applied to a graph with edges of unit length, Dijkstra’s algorithm reduces to breadth first
search. If we replace heuristic function of best first search with identity function, best first search reduces to
breadth first search.
6.21 Consider a spatial network representing road maps with road-intersections as nodes and roadsegments between adjacent intersections as edges. Coordinates of center points of road-segments are
provided. In addition, stores and customer tables provide names, addresses and location of retailstores and customers. Explain how a shortest path algorithm may be used for the following network
analysis problems:
1. For a given customer, find the nearest store using driving distance along road network.
This might require multiple Best-first single pair computation. For each of these queries the v node is the
consumer’s origin (e.g., work or home). The u node would be drawn from a list of stores that match the
user’s specification. The algorithm could find the distance between v and the first u in the list of u’s. This
would be set as the distance to beat. If a subsequent search finds a store that is closer along the network
than the “distance to beat,” then that store and distance are set as the “distance to beat.” In this way, the
search could continue through the list of u’s. Pruning could occur if the cumulative edge distance of a
given u is longer than the “distance to beat.”
2. For a given store S and a set of customers, find the cheapest path to visit all customers starting
from S and returning to S.
This task might be considered using a local or global distance optimization. Using a local (or greedy)
approach, I might use multiple Best-first single pair computations. The first search would find the closest
customer to S. Then the nearest customer is set as the source and the distance is measured to the remaining
customers. This would continue until all customers were visited.
A global optimization might test every combination of routes that connect the customers then select the
shortest. This however, would be very computationally expensive.
3. For each store determine the set of customers it should serve minimizing the total cost of serving
all customers by using nearest stores.
This is very similar to the first scenario; however, in this situation there are two lists of nodes. The source
node is still the customer’s location, but the search will be repeated for each customer in the list. For
instance, this might be abstracted as a nested for loop. The outer for loop is “for each customer find the
nearest store.” The inner for loop checks the network distance to each store. The closest store adds the
customer to its set. Then the loop continues through the customer list. In this way each customer will be
assigned to a given store. This is also very computationally expensive, however it would only have to be
performed periodically to update the stores’ sets.
4. Determine appropriate location for a new store to service customers farthest from current stores.
I can imagine a brute force approach, which finds a global measure of total (or mean) network distance
from customers to existing stores. Then place a new store node on the road network in the system and
recalculate the global measure. Iteratively move the new store along major roads until the global measure
of customer distance traveled is minimized. This approach is sort of like a k-means clustering algorithm,
but we are trying to minimize within cluster distance and maximize among cluster distance (stores would
represent cluster centroids).
This approach could be optimized by computing distance from consumer to existing stores then using this
information to find clusters or hot spots of customers that are furthest from stores. These hot spots would
be places where we might initially place the new store. Then as before, move the new store node to try to
minimize the global distance traveled.
Chapter 7
7.1
(a)
Set
{a,b}
{c}
{a,b,c}
Support
3 or 3/8 = 37.5%
6 or 6/8 = 75%
2 or 2/8 = 25%
(b)
Confidence for association rule {a,b}  {c}:
2/3 = 66.7%
(c)
Confidence for association rule {c}  {a,b}:
2/6 = 33.3%
(d)
Support for the association rule i1  i2 is based on the count (or proportion) of all transactions that
contain both i1 and i2. In contrast, confidence for the association rule i1  i2 is the proportion of
transactions containing i1 that also contain i2. Therefore, the support for i1  i2 and i2  i1 must be
the same, but the confidence for the two rules will differ (as they do in (b) and (c) above) if the
count of transactions containing i1 does not equal the count of transactions containing i2.
(e)
Note: I have found it impossible to complete the exercise given only the information provided in
the table.
First, the data are mathematically invalid. According to the table, 45 of 100 lakes are near forest and 30 lakes are both near forest
and inside a state park. This means there must be 45 – 30 = 15 lakes that are near forest but not inside a state park. But the table
also indicates that 90 of the 100 lakes are inside a state park, meaning that there can only be 10 lakes that are not inside a state
park! So it is impossible that 15 lakes could be near forest but not inside a state park!
Second, even if the data were valid, it would not be possible to determine how many lakes are
both inside a state park and adjacent to federal land (as is necessary to complete the problem).
To complete the problem, I will make two assumptions:
1) There are 75 lakes (not 90) inside a state park.
2) There are 35 lakes that are both inside a state park and adjacent to federal land.
These assumptions mean the lakes can be summarized by the following diagram:
near(X,forest
5
)
20
20
inside(X,state_park)
5
10
10
25
5
adjacent(X,federal
_land)
Therefore, the support and confidence for the association rules given in the table are:
Rule
lake(X)  near(X, forest) 45
Support
Confidence
30
35
30/75 = 40%
35/75 = 46.7%
10
10/35 = 28.6%
45/100 = 45%
lake(X) and inside(X, state_park)  near(X, forest)
lake(X) and inside(X, state_park)  adjacent(X, federal_land)
lake(X) and inside(X, state_park)
and adjacent(X, federal_land)  near(X,forest)
None of these rules have both a support more than 30 and a confidence more than 70 percent.
(f)
For a particular point p that is in neither Cmo nor Cmn, the distance between p and the nearest
medoid mi, d(p,mi), will be the same for both Mt and Mt+1 because the nearest m will be the same in
both cases. (The only m that change between Mt and Mt+1 are mo and mn) Because there is no
change in d(p,mi) for all such points p, only points that are in Cmo or Cmn must be fetched from
main memory, and only the nonmedoid points need to be fetched because the locations of the
medoids mo and mn (which are needed to compute d(p,mi) for all the nonmedoid points) should
already be available from prior processing in the algorithm.
(g)
The time required to fetch only the nonmedoid points of Cmo and Cmn relative to the time required
to fetch all points would be no more than 2/k (assuming that all clusters have the same size)
because fetching all points would require fetching the points for k clusters instead of just 2
clusters. However, Cmo and Cmn may contain many of the same points if mo and mn are near to
each other, in which case the optimal relative processing time could be as low as 1/k if the points
in Cmo are almost the same set as the points in Cmn.
7.4
a.)
Association rules are a kind of statistical correlation method. The major difference is in that ARs
require a conditional probability test (X→Y) between like objects (items in a store) whereas statistical
correlation is more interested in causal rules between different groups (big car = lousy miles-per-gallon).
b.)
It may be easier to see the distinction with respect to time. Autocorrelation implies that “things”
are always related under a given circumstance, as a universal law in any data set. In a sense, it is an
observation of itself that occurs periodically. Cross-correlation is the relation between two different sets of
data within the same period interval.
c.)
In this, classification is only part of a complete solution with spatial data mining. In location
prediction, we need an accurate classification model, but we also have to consider the relation of locality
and spatial context to the formulation.
d.)
When defining a hot spot, it is commonly considered that the points (or data items) are relatively
similar to a very high degree, which indicates their usefulness. A cluster is less extreme in this regard,
whereas while it may have a number of related elements, the points can still be diverse within the cluster.
e.)
The goal of classification is to assign items (or points) to a certain pre-defined class. Clustering
does not have the notion of a pre-defined class, and in fact, is used to help develop a classification scheme.
f.)
Associations demonstrate correlation between some item (or set of items) and another. By finding
associations, we can define a better classification structure. Here, there is a measure of certainty that these
items will always be correlated. In clustering, the goal is to minimize the difference between objects within
the cluster and maximize the difference between clustered objects from other clusters. However, it still can
be true that items may still be diverse (although slightly) within the cluster, and there isn’t a measure of
certainty that these items will always be clustered
7.7 What is special about spatial statistics relative to statistics?
One difference between spatial statistics and (regular non-spatial) statistics is that spatial statistics has more
than one dimension to them. Another difference is that spatial data does not hold the assumption of
independence property. Spatial data usually is highly self-correlated. This property that like things cluster
together is space is considered the first law of geography (Everything is related to everything else, but
nearby things are more related than distant things).
7.8 Which of the following spatial features show positive spatial auto-correlation? Why? (Is there a
physical/scientific reason?)
Spatial autocorrelation refers to the relationship among values of a variable attributable to the way that the
corresponding area units are ordered in space. Positive spatial autocorrelation means that adjacent area
units have similar values or characteristics, negative spatial autocorrelation means that nearby area units
have dissimilar values or characteristics, and finally, no spatial autocorrelation means that there is no
particular systematic structure on how the pattern is formed. Water content, temperature, soil type, and
annual precipitation (rain, snow) have positive spatial auto correlation. Close by area tend to have similar
water content, temperature, soil type, and annual precipitation.
7.9 Classify the following spatial point functions into classes of positive spatial autocorrelation, no
spatial autocorrelation, and negative spatial autocorrleation.
a)
positive spatial autocorrelation
b)
negative spatial autocorrelation
c)
positive spatial autocorrelation
d)
no spatial autocorrelation
7.10 " The only data mining techniques one needs is linear regression, if features are selected
carefully "
Linear regression can be used for classifying linearly discreteable classes, if the features are selected
carefully to simplify the model linear regression is enough.
7.11 Compute Moran’s I for the gray scale image shown in Figure 7.14(a), and matrices in Figure
7.6(b) and 7.6(c).
Figure 7.6.b) I used mathlab to calculate I values
z=[0 1 -4 -2 6 4 -5 -3 3]
z=
0
1
-4
-2
6
4
-5
-3
3
>> w=[0 0.11 0 0.11 0 0 0 0 0 ; 0.11 0 0.11 0 0.11 0 0 0 0; 0 a 0 0 0 a 0 0 0; a 0 0 0 a 0 a 0 0; 0 a 0 a 0 a 0 a
0 ;0 0 a 0 a 0 0 0 a; 0 0 0 a 0 0 0 a 0 ; 0 0 0 0 a 0 a 0 a ; 0 0 0 0 0 a 0 a 0 ]
??? Undefined function or variable 'a'.
>> a=0.11 -> normalized value in weighted matrix
a=
0.1100
>> w=[0 0.11 0 0.11 0 0 0 0 0 ; 0.11 0 0.11 0 0.11 0 0 0 0; 0 a 0 0 0 a 0 0 0; a 0 0 0 a 0 a 0 0; 0 a 0 a 0 a 0 a
0 ;0 0 a 0 a 0 0 0 a; 0 0 0 a 0 0 0 a 0 ; 0 0 0 0 a 0 a 0 a ; 0 0 0 0 0 a 0 a 0 ]
w=
0 0.1100
0 0.1100
0
0
0
0
0
0.1100
0 0.1100
0 0.1100
0
0
0
0
0 0.1100
0
0
0 0.1100
0
0
0
0.1100
0
0
0 0.1100
0 0.1100
0
0
0 0.1100
0 0.1100
0 0.1100
0 0.1100
0
0
0 0.1100
0 0.1100
0
0
0 0.1100
0
0
0 0.1100
0
0
0 0.1100
0
0
0
0
0 0.1100
0 0.1100
0 0.1100
0
0
0
0
0 0.1100
0 0.1100
0
>> I=(z*w*z')/(z*z')
I=
0.0152
Figure 7.6.c)
>> z=[ -2 4 6 -4 0 3 -5 -3 1]
z=
-2
4
6
-4
0
3
-5
-3
1
>>
>> w=[0 a 0 a 0 0 0 0 0 ; a 0 a 0 a 0 0 0 0 ;0 a 0 0 0 a 0 0 0 ; a 0 0 0 a 0 a 0 0 ; 0 a 0 a 0 a 0 a 0 ; 0 0 a 0 a 0
0 0 a ; 0 0 0 a 0 0 a 0 0 ; 0 0 0 0 a 0 a 0 a ; 0 0 0 0 0 a 0 a 0]
w=
0
0.1100
0
0.1100
0
0
0
0
0
0.1100
0 0.1100
0 0.1100
0
0
0
0
0 0.1100
0
0
0 0.1100
0
0
0
0.1100
0
0
0 0.1100
0 0.1100
0
0
0 0.1100
0 0.1100
0 0.1100
0 0.1100
0
0
0 0.1100
0 0.1100
0
0
0 0.1100
0
0
0 0.1100
0
0 0.1100
0
0
0
0
0
0 0.1100
0 0.1100
0 0.1100
0
0
0
0
0 0.1100
0 0.1100
0
>> I=(z*w*z')/(z*z')
I=
0.1555
Figure 7.14.a)
>> z=[-6.125 -4.125 -1.125 -.125 -2.125 -8.125 3.875 3.875 3.875 -3.125 -11.125 4.875 -1.125 1.875
11.875 6.875 ]
z=
Columns 1 through 13
-6.1250 -4.1250 -1.1250 -0.1250 -2.1250 -8.1250
4.8750 -1.1250
3.8750
3.8750
3.8750 -3.1250 -11.1250
Columns 14 through 16
1.8750 11.8750
6.8750
>> a=.0625
a=
0.0625
>> w=[0 a 0 0 a 0 0 0 0 0 0 0 0 0 0 0 ; a 0 0 0 0 a 0 0 0 0 0 a 0 0 0 0 ; 0 a 0 a 0 0 a 0 0 0 0 0 0 0 0 0 ; 0 0 a 0
000a00000000;a0000a00a0000000;0a00a0a00a000000;00a00a0a00a0
0000;000a00a0000a0000;0000a0000a00a000;00000a00a0a00a0 0;000
000a00a0000a0;0000000a00a0000a;00000000a0000a00;000000000a0
0a0a0;0000000000a00a0a;00000000000a00a0]
w=
Columns 1 through 13
0 0.0625
0
0 0.0625
0
0
0
0
0
0
0
0
0.0625
0
0
0
0 0.0625
0
0
0
0
0 0.0625
0
0 0.0625
0 0.0625
0
0 0.0625
0
0
0
0
0
0
0
0 0.0625
0
0
0
0 0.0625
0
0
0
0
0
0.0625
0
0
0
0 0.0625
0
0 0.0625
0
0
0
0
0 0.0625
0
0 0.0625
0 0.0625
0
0 0.0625
0
0
0
0
0 0.0625
0
0 0.0625
0 0.0625
0
0 0.0625
0
0
0
0
0 0.0625
0
0 0.0625
0
0
0
0 0.0625
0
0
0
0
0 0.0625
0
0
0
0 0.0625
0
0 0.0625
0
0
0
0
0 0.0625
0
0 0.0625
0 0.0625
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.0625
0
0 0.0625
0
0
0
0 0.0625
0
0 0.0625
0
0
0
0 0.0625
0
0
0
0
0
0
0 0.0625
0
0 0.0625
0
0
0
0 0.0625
0
0
0
0
0
0
0 0.0625
0
Columns 14 through 16
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.0625
0
0
0 0.0625
0
0
0 0.0625
0.0625
0
0
0 0.0625
0
0.0625
0 0.0625
0 0.0625
0
>> I=(z*w*z')/(z*z')
I=
0.0100
7.12 Compare and contrast the concepts in the following pairs:
i. Spatial Outliers vs. Global Outliers
A global outlier defines inconsistencies in relation to the whole of the data whereas spatial outliers are
concerned with a localized phenomenon. For example, consistently inconsistent readings from a sensor
compared to all other sensors is a case of a global outlier (and probably indicates sensor failure). A spatial
outlier is a strange “pocket” of inconsistent readings that were not discovered at any other location.
ii. SAR vs. MRF Bayesian Classifiers
The primary difference between the two is that SAR (Spatial Autoregressive Regression) treats object
classes as independent of each other. In MRF, class labels of neighbors are considered to be related to the
definition of the object. For both, predictions are localized to a strict neighborhood.
iii. Co-locations vs. spatial association rules
Spatial association rules need not consider locality to other objects when defining a relationship. In fact,
they do not consider individual items at all, and instead tries to categorize behavior across the whole of the
data set. Co-location rules attempt to find common pairs (or groups) of items that repeatedly appear
together within some defined distance. Nothing needs to be known about the data set, other than that items
are distinct and identifiable, whereas in spatial association, some knowledge of what exists in the data must
be known to create effective rules.
iv. Classification accuracy vs. spatial accuracy
Both are intended to determine the effectiveness of classification. The first tests what percentage of all
objects was correctly classified by the algorithm. However, when space is discretized for classification, we
need to know how accurate the predictions are based on their actual locations. Hence, the need for a test of
spatial accuracy.
7.13 Spatial data can be mined either via custom methods (e.g., collocations, spatial outliers
detection, SAR) or via classical methods (e.g., association rules, global outliers, regression) after
selecting relevant spatial features. Compare and contrast these approaches. Where would you use
each approach?
Global outliers is a spatially referenced object whose values are significantly different from those of other
spatially referenced objects in its spatial neighborhood. A spatial outlier is a local instability to its
neighbors.
Spatial association rules are defined in terms of spatial predicates rather than items. It was designed for
categorical attributes. Colocation rules attempt to generalize association rules to point collection data sets
that are indexed by space.
SAR is spatial autoregressive regression. SAR model is similar to the linear logistic model in terms of the
transformed feature space. In other words, the SAR model assumes the linear separability of classes in
transformed feature space.
7.14
(Group 8 did not answer this)
7.15 Identify two clusters identified by k-medoid algorithm for the following set of pts:
(i) (0,0) (1,1) (1,0) (0,1) (6,6) (7,6) (6,7) (7,7)
(ii) (0,0)(1,1)(1,0)(0,1)
clearly it would be (1,1) and (6,6)
most likely it would be (0,0) and (1,1)
7.16
(Group 9 did not answer this)
7.17 Define scale-dependent and scale-independent spatial patterns. Provide examples.
Scale-dependent spatial phenomena are those phenomena whose spatial patterns vary at different scale
from observers’ point of view. For example, at local scale (say, neighborhood level), the streets appear to
be regular grid. However, at city level, the streets network might no longer appear as regular grid any more.
Most of spatial patterns are scale-dependent. Actually in geography community, it is very important to
situate one’s research at specific scale, and realize that most geography phenomena are scale-dependent.
On the contrary, scale-independent spatial phenomena do not show different spatial patterns at different
geographic scale. For example, no matter in which scale, fish live in the water, not on the land. So you
can’t go fishing in the land area that does not have any water.
7.21 Consider the problem of detecting multiple spatial outliers. Discuss how one should extend the
techniques discussed in Section 7.6. Note that a true spatial outlier can make its neighbor be flagged
as spatial outliers by some of the tests discussed in Section 7.6.
Given the techniques described in Section 7.6, the detection of multiple spatial outliers can become less
than precise because of the outliers’ influence on neighboring values. Thus, in order to effectively check
for multiple spatial outliers in a dataset, it is imperative to account for this neighborhood problem. One
more computationally intense method that could detect these outliers is finding one outlier per calculation
of the scatterplot, Moran scatterplot, or S(x). Once the value that is detected and isolated, one can remove
that value from the dataset and run these calculations once again and determine the next outlying data point
from the set. This could continue until none of the data points fall under a certain threshold to determine
whether it is an outlier.
Another method would be an extension of this calculation that might simplify based on the number of
iterations but may introduce more opportunity for error. From the first calculations, find all of the spatial
outliers that are not neighbors of another spatial outlier. If there are multiple outliers next to each other,
have the algorithm select the point that is most different within that population. Isolate these values, and
perform new calculations minus these values in the same manner as above. If there are still outliers, follow
the same isolation process and create new plots until there are no more determined outliers. Again, this
might introduce a bit more error into the calculations, but this could only be determined through testing of
the system.
Chapter 8
8.1
8/3
11/5
10/5
5/3
11/5
15/8
17/8
11/5
13/5
15/8
23/8
14/5
7/3
14/5
14/5
11/3
8.2
In order to compute Rhp = R – Rlp, we must first compute Rlp as shown in problem 8.1:
R
1
2
3
3
2
4
1
3
2
2
2
5
1
1
4
4
2.25
2.16
Rlp
2.67
2.16
2.11
2.77
2.83
2
2.11
2.88
3.16
1.5
2
3
3.75
2.5
Now we can compute Rhp:
R
Rhp
-.33
1
2
3
3
-1.25
-.16
2
4
1
3
-.16
1.89
-1.77
.17
2
2
2
5
0
-.11
-.88
1.89
1
1
4
4
-.5
-1
1
.25
8.3.
1
2
3
2
4
1
2
2
2
1
1
4
Nitrogen Content
A
.5
3
3
5
4
A
B
B
B
B
B
C
C
A
C
C
C
A
Soil Map
The raster maps shown above represent the nitrogen content and soil type at each pixel. Calculate
the average nitrogen content of each soil type and classify the operation as either local, focal, zonal,
or global.
Operation is zonal since value of a cell in the new raster is a function of the value of that cell in the original
layer and the values of other cells which appear in the same zone specified in Soil map raster. Average
value of each nitrogen content for each soil type will be as follow:
For A: (1+2+5+4)/4 = 3
For B: (3+3+4+1+3)/5 = 2.8
For C: (2+2+1+1+4)/5 = 2
Raster map for this operation will be as follows:
3
3
2.8
2.8
2.8
2.8
2.8
2
2
3
2
2
2
3
Average nitrogen content
8.4 The raster layer R shows the location of households (H) and banks (B). Assume each household
does business with both the banks. A survey of the last thirty visits by members of the household was
conducted which showed that a household interacted with the nearest band 60 percent of the time.
What was the averaged distance that members of the household traveled to banks in the last thirty
days? Assume a distance of one to horizontal and vertical neighbors, and 1.4 to diagonal neighbors.
For this question, I assumed that each household visited the bank 30 times (i.e., once a day for 30 days).
Further, as stated in the question, households preferentially visited the closer banks (i.e., 60 percent of visits
to the closest bank) and that diagonal distance is 1.4 and edge distance is 1. To answer this question I made
a distance grid for both banks then used these to calculate the averaged distance grid. Example calculation:
for the top left cell in the averaged distance traveled grid the calculation is (18*2 + 12*3.4)/30 = 2.56.
8.7 (p.248)The vector versus raster dichotomy can be compared with the “what” and “where” in
natural language. Discuss.
Raster is defined by a grid of pixels, each pixel is a different color to make an entire image. To display a
Raster graphics, we need to define “what” color each pixel should have. Once a raster image has been
acquired, a geo-reference is applied. This is the process of relating the grid positions of the pixels to their
corresponding latitude and longitudes. In this way, a computer can relate pixel position to latitude and
longitude. However, the system has no knowledge of the details (such as the coast line) in the raster images
it displays.
Vector graphics, on the other hand, are not defined by pixels and are not constricted to a grid format.
Vector graphics are given instructions by the computer about how the objects should be shaped and their
relative size. To display a Vector graphics, the computer need to know “where” the object need to be
displayed.
8.8 Is topology implicit or explicit in raster DBMS?
Topology is explicit in raster DBMS.
8.9 Compute the storage-space needed to store a raster representation of a 20km x 20km rectangular
area at a resolution of:
(a) 10m x 10m pixels
 need 4×106 pixels
(b) 1m x 1m pixels.
 need 4×108 pixels
(c) 1cm x 1cm pixels.
 need 4×1012 pixels
8.10 Study the raster data model proposed by OGIS employed by Idrisi Kilimanjaro, a raster-based
GIS software package developed by Clark Labs. Compare it with the map algebra model discussed
in this chapter.
As with most other GIS software packages, Idrisi Kilimanjaro utilizes the map algebra model for most of
its raster-based analyses. This historically has proven to be the easiest and most straightforward method of
overlay among raster data sets. Idrisi is among the more powerful of the raster-based GIS software systems.
It can perform simple overlays to combine multiple datasets into new data using simple mathematics, such
as addition and normalized differencing. It can perform logarithmic and power transformations and can
differentiate between the four different operations (local, focal, zonal, and global) fairly well. In performing
neighborhood analyses, such as slope and aspect determination and even the more difficult interpolation
techniques (inverse distance weighting, kriging, etc.), Idrisi can determine the extent of its neighborhood
and
determine
to
which
of
the
four
different
operations
it
belongs.
One reason that OGIS may not have made a public recommendation for the raster data model just yet is that
the map algebra model is so dominant within the GIS software community already. Maybe OGIS is still
trying to find another approach to combining and overlaying raster data, but to this point nothing appears to
be as effective as the map algebra approach. It will be interesting to see what OGIS finally does decide to
recommend for the raster data model and how much it differs, if any, from the already widely-used map
algebra model.
8.11 Define the “closure” property of algebra. Evaluate OGIS model for vector spatial data for
closure.
The closure property of algebra: If an operation is applied to the two elements of a domain, the result must
be in the same domain to satisfy the closure property.
The geometry collection spatial data provide a ‘closure’ property to OGIS spatial data types under
geometric operations such as ‘geometric-difference’ or ‘geometric-intersection’ (textbook p.26)
8.12 What is special about data warehousing for geospatial applications (e.g. census)?
A. The data used contain both spatial and nonspatial attributes. Result shown needs to be presented by
maps contrast to tables or text used by traditional data warehousing methods. Visualization of data with
spatial representations is a very important for geospatial applications. Geospatial applications do need to
spatial aggregates but this is very difficult to implement.
8.13 Compare and contrast:
a.) Measures vs. Dimensions
Dimensions are independent variables that uniquely identify measures (dependant variables). A quick
example of this is that in a school database, your social security number or student id (dimension) can be
used to identify what college of study you belong to (measure), and not the other way around.
b.) Data Mining vs. Data Warehousing
Data Ming can be seen as the application of methods and algorithms to find patterns in the data. This, in
turn, allows for better prediction of future behavior. In Data Warehousing, the process is more deductive by
finding hypothetical patterns in historical data. In a sense, Data Mining is concerned with modeling of the
data, wheras Data Warehousing provides the actual data and helps to inform the DM algorithms of what
would be a reasonable pattern to test for.
c.) Rollup vs. Drilldown
A rollup on a given dimension or set of dimensions is an aggregation of the data into a smaller set of tuples.
These tuples can be seen as an abstraction of the original set of data by reducing the level of detail
considered. A drilldown operation does the exact opposite of rollup.
d.) SQL2 Group By vs. Group By with CUBE
With the group by clause, you can specify which dimensions you want to examine (along with defining
them in the select clause). This translates to a rollup on the dimensions not specified. The CUBE operator is
a generalization of this procedure and provides a tabular, combinatorial view from the specified Ndimensions. Thus, the regular group by aggregation is done once, but the CUBE operation shows all
possible aggregations beyond the first one specified.
e.) CUBE vs. Rollup
The differences between these has been alluded to as above. Basically that the rollup operation provides a
single aggregation operation and that CUBE shows a generalized version with aggregations shown after the
specified aggregation.
8.14 Classify following aggregate spatial functions into distributive, algebraic, and holistic functions:
mean, medoid, autocorrelation (Moran’s I), minimum bounding orthogonal rectangle and minimum
bounding circle, minimum bounding convex polygon, spatial outlier detection tests in section 7.4.
Distributive: minimum bounding orthogonal rectangle and minimum bounding circle, minimum bounding
convex polygon,
Algebraic: mean, Moran’s I
Holistic: medoid, spatial outlier detection tests
8.15
(Group 8 did not answer this)
8.16 Some researchers in content based retrieval consider ARGs to be similar to the Entity
Relationship Models. Do you agree? Explain with a simple example.
No, ER modeling does not interact with queries in the same manner as an ARG. If I query a database of
bank transactions modeled by an ER diagram, there is no parallel to the feature-space points which provide
ARGs their value.
8.17 Can ARG model raster images with amorphous phenomenon.
ARGs are completely connected graphs that represent objects and their relationships. This seems naturally
suitable to model discrete phenomenon. Using this model to represent amorphous phenomenon could be
inaccurate. ARG requires specification of objects(discrete) within images and the relationships are
specified between the objects. For amorphous phenomenon, this encapsulation can lead to errors. But, if we
are interested in the values at certain points the model would work. Capturing the continuous nature of
these phenomenon might not be fully possible if we use ARGs.
8.18 Any arbitrary pair of spatial objects would always have some spatial relationship, e.g., distance.
This can lead to clique like ARGs with large number of edges to model all spatial relationships.
Suggest ways to sparsify ARGs for efficient content based retrievals.
We could think of several ways to sparsify ARGs for efficient content based retrieval.
1) Method 1: Since spatial objects usually have stronger connections with nearby objects, and have weaker
connections with objects far away. So we can use a distance threshold as filter. For objects at a distance
greater than that threshold, we don’t need to model their relationship using ARGs.
2) Method 2: Instead of using distance as a filter, we can also use a specific number n as the filter. We only
use ARGs to model the relationship between a spatial object and its n nearest neighbors.
3) Method 3: First, we use clustering algorithm to divide all spatial objects into m clusters. Then we use
traditional ARGs to model spatial relationship for objects within each cluster. For objects belong to
different clusters, we will not directly using ARGs to model their spatial relationship. Instead, we will use
the centroid of each cluster as the representative of each cluster. Then use ARGs to model spatial
relationship of all these representatives (centroids).
8.21 (p.249) Explore a spreadsheet software, for example, MS excel, to find pivot operation. Create
an example data-set and illustrate the effect of pivot operation.
Mary
7 Formula
5
Mary
9 Beer
10
Jenny
12 Bread
3
Kate
15 Beer
8
Mike
20 Beer
10
MIke
20 Bread
1
Sum of amount purchase by each person
Sum of
Amount
Name
Jenny
Date
5
Purchase
Formula
5 Total
12
Bread
1
Beer
15
Beer
12 Total
Jenny Total
Kate
1 Total
15 Total
Kate
Total
Mary
1
Diaper
7
Diaper
Formula
9
Beer
1
Beer
20
Beer
Bread
1 Total
7 Total
9 Total
Mary Total
Mike
1 Total
20 Total
Mike Total
Grand Total
Total
10
10
3
3
13
7
7
8
8
15
5
5
5
5
10
10
10
25
9
9
10
1
11
20
73
Sum of purchase by item
Sum of
Amount
Purchase
Beer
Name
Kate
Kate Total
Mary
Mary Total
Mike
Date
1
15
9
1
20
Mike Total
Beer Total
Bread
Jenny
Jenny Total
Mike
Mike Total
Bread Total
Diaper
Mary
Mary Total
Diaper Total
Formula
Jenny
Jenny Total
Mary
Mary Total
Formular Total
Grand Total
12
20
1
7
5
7
Total
7
8
15
10
10
9
10
19
44
3
3
1
1
4
5
5
10
10
10
10
5
5
15
73
Sum of purchase by date
Sum of
Amount
Date
1
Purchase
Beer
Beer Total
Diaper
Diaper Total
Name
Kate
Mike
Mary
1 Total
5
Formula
Jenny
Formula Total
7
Diaper
Mary
Diaper Total
Formula
Mary
Formula Total
9
Beer
Beer Total
Mary
Bread
Bread Total
Jenny
Beer
Beer Total
Kate
Beer
Beer Total
Bread
Bread Total
Mike
5 Total
7 Total
9 Total
12
12 Total
15
15 Total
20
20 Total
Grand Total
Sum of all purchase amount
Sum of
Amount
Name
Jenny
Kate
Mary
Mike
Grand Total
Total
13
15
25
20
73
Mike
Total
7
9
16
5
5
21
10
10
10
5
5
5
5
10
10
10
10
3
3
3
8
8
8
10
10
1
1
11
73
8. 22. Study the census data-set and model it as a data warehouse. List the dimensions, measures,
aggregation operations, and hierarchies.
Using the NHGIS dataset as an example. The examples I list here only limit to demographic information.
Since demographic data does not make too much sense if grouped by year (for example, it does not make
sense to sum population of different years). So we need to create data cubes for different census years
separately. The following example only refers to data cubes for a specific census year.
Dimensions are as follows:
1) Location (which has a hierarchical structure: RegionStateCountyTractBlockGroupBlock)
2) Sex
3) AgeGroup
4) Race
5) Origin
6) MarriageStatus
Measure: Population
Since the location dimension has a hierarchical structure, the whole dataset must be structured as a lattice
of data cubes, where each cube is defined by the combination of a level of detail for each dimension. Data
abstraction in this model means choosing a meaningful summary of the data. Choosing a data abstraction
corresponds to choosing a particular projection in this lattice of data cubes: (a) which dimensions we
currently consider relevant and (b) the appropriate level of detail for each relevant dimensional hierarchy.
Specifying the level of detail identifies the cube in the lattice, while the relevant dimensions identify which
projection (from n dimensions down to the number of relevant dimensions) of that cube is needed.
Region _Sex_AgeGroup _Race_Origin _M arri ageSt atu s
Least Det ai led
St ate_Sex_AgeGroup _Race_Ori gin _M arri ageSt atu s
Co un ty_Sex_AgeGro up_Race_Ori gi n _MarriageStat us
Tract_Sex_AgeGroup _Race_Origin _M arri ageSt atu s
BlockGroup _Sex_AgeGrou p_Race_Ori gi n _MarriageStat us
Block_Sex_AgeGrou p_Race_Ori gi n _MarriageStat us
M ost Det ai l ed
Figure 1 The Lattice of Data Cubes (hierarchy defined by location)
1_2_3_4_5_6
6-D data cube
5-D data cube
1_2_4_5_6
1_2_3_5_6
1_2_3_4_6
1_2_3_4_5
1_3_4_5_6
2_3_4_5_6
4-D data cube 1_2_3_4 1_2_3_5 1_2_3_6 1_2_4_5 1_2_4_6 1_2_5_6 1_3_4_5 1_3_4_6 1_3_5_6 1_4_5_6 2_3_4_5 2_3_4_6 2_3_5_6 2_4_5_6 3_4_5_6
3-D data cube
1_2_3 1_2_4 1_2_5 1_2_6 1_3_4 1_3_5 1_3_6 1_4_5 1_4_6 1_5_6 2_3_4 2_3_5 2_3_6 2_4_5 2_4_6 2_5_6 3_4_5 3_4_6 3_5_6 4_5_6
1_2 1_3 1_4 1_5 1_6 2_3 2_4 2_5 2_6 3_4 3_5 3_6 4_5 4_6 5_6
2-D data cube
1-D data cube
0-D data cube
1
2
LEGEND
1: Location
2: Sex
3: AgeGroup
4: Race
5: Origin
6: MarriageStatus
3
4
5
6
Aggregate: Sum ALL
For Example: 1_2_3_4_5 Represent Cube-Query
SELECT Location, Sex, AgeGroup, Race, Origin, ALL Sum(Population) AS Population
FROM POPULATION
GROUP BY CUBE Location, Sex, AgeGroup, Race, Origin
Figure 2: The 0-D, 1-D, 2-D, 3-D, 4-D, 5-D, and 6-D data cubes
In this example, we can use aggregation operations such as: Min., Max. (eg, minimum population of state),
Sum (e.g., Sum the population of all states), Average (Average population of all states). Basic cube
operators such as Roll-up, Drill-down, Slice and dice, Pivoting, can also be used based on the above
aggregation hierarchy.
NEW QUESTIONS
Design and answer 3 new questions on selected sections in the book.
Question 1
Chapter 1: Design a query that could be either spatial or nonspatial and explain.
“Find the top salesperson in each region”
This query depends on how the word region is defined. This query can be nonspatial if the regions are
static. However if the regions are defined differently the query could be spatial. The MN regions of
Minneapolis, St. Cloud, Rochester, and Duluth’s boundaries may change often.
Question 2
Chapter 8.1.1
For the following neighborhood find the:
6
7
-3
9
3
4
-6
2
5
0
3
4
0
1
-4
-6
a) FocalSum using the Rook scheme
b) FocalSum using the Bishop scheme
c) FocalSum using the Queen scheme
22
12
5
-3
a)
9
19
0
2
13
25
7
9
8
10
10
6
-2
1
-4
-6
1
9
6
2
4
3
4
-1
12
15
12
4
2
3
4
-1
Rook
20
5
19
2
b) Bishop
25
21
11
-1
26
27
24
8
c)
Queen
Question 3
Chapter 2.1.4
Explain why the following nine-intersection model will not work for two dimensions:
1
1
0
1
1
0
0
0
0
What this is saying is that the interior and boundary of A and B touch each other, however the exteriors do
not meet. So if the interiors and boundaries are the same it leads to two possiblities either the objects
overlap each other at some point or the objects overlap at every point (are totally identical). However since
the exteriors do not meet this is impossible. The nine intersection model for overlap contains a ‘1’ in all
nine positions.
Mid-term Exam. Solutions (Csci 8715, Spring 2004)
1. Currently literature survey via google.com, though useful, is not adequate. It does not
cover a large fraction of the web, can not search many databases, and does not have
access to either older documents or tacit (i.e. not yet codified) knowledge. It addition, it is
non-trivial for users to identify set of all keywords relevant to the topic of the search.
Thus researchers should complement use of google.com with some of the following:
(a) Search relevant databases, e.g. DBLP, citeseer, ACM DL, IEEE DL,
amazon.com etc.
(b) Visit libraries to search for older documents
(c) Contact domain experts to for help in identifying relevant papers, keywords,
people and tacit knowledge
2. Use pictogram grammar in chapter 2 to design pictograms for
(a) linestring or polygon
(b) point
(c) add a new entity road-network, a new relationship part-of and add pictograms
(d) point
3. Parts
(a) River originates in a country has two cases. The river may end within the same
country without ever leaving the boundaries. This case is modeled using inside or
within. Alternative the river may flow outside the country of origin. This will be
modeled by “cross”. Note that overlap is not appropriate unless river is a
polygon.
(b) Topological relationship between Vatican and Italy seemed to confuse many
students. Note that the interior of Italy will have a hole to account for the interior
of Vatican. Thus the interiors of Italy and Vatican do not intersect. The
relationship is modeled by touch given the OGIS topological predicate. Note that
this example shows a limitation of OGIS model, which does not model special
spatial relationships among polygons with holes.
(c) Partitions require a collection of OGIS relationships. Forest-stands are pair-wise
disjoint. Each forest-stand is covered by its forest. Finally the union of all foreststands should cover (or equal) the forest.
4. Most of the students did fine with this questions. A few made the mistake of duplicating
nodes/leafs across the siblings in a R-tree. Recall that the MOBRs of siblings may
overlap in a R-tree but do not. R+tree allows replication of data items to eliminate the
overlap between MOBRs of siblings.
I will use nested parenthesis to denote the trees:
(a) R-tree: (root <1 (A (a b c)) (B (d e f)) > <2 (C (g h) ) (D (I j)) (E (k l)) > )
(b) Search path: root – 1 – A – (c) – B – d – e – 2 – C – g – h – D – I
5. Most of the students did fine, but a few tried to argue that HEPV is always better than
FPV. Complete domination of a method by another is quite rare in systems research area.
Usually each method dominates in some area. This is case here. HEPV reduces cost of
updates and storage costs, however FPV is faster in computing shortest paths.
Download