CS8715 FINAL EXAM STUDY NOTES ANSWERS TO THE TEXTBOOK QUESTIONS Chapter 1 1.1 Discuss the differences between spatial and nonspatial data. Spatial data is data that has element related to space or location. Examples of spatial data include zip codes and shapes (such as molecule shapes). Nonspatial data is any data without a spatial attribute. Nonspatial data includes data such as names and most values (blood pressure, finish times in a marathon). 1.2 Geographic applications are a common source of spatial data. List at least four other important sources of spatial data. Satellite images Medical images Phone book (address) Environmental agency Transportation agency Cell phone Company Internet 1.3 What are the advantages of storing spatial data in a DBMS as opposed to a file system? The focus of data storage via file systems is to minimize the computational time for an algorithm based on the assumption that all necessary data are located within an infinite supply of main memory. Unfortunately, this is not the case as most data cannot be stored completely in this manner; rather, the data resides on a hard disk and retrieval of this information takes much more computational time when compared to memory access. Thus, DBMS storage has focused on optimizing the I/O time. Storing spatial data in a DBMS results in faster data retrieval processes because a DBMS utilizes an index for optimal data queries. While indexing historically has resulted in a loss of spatial proximity, the development of spatial DBMSs is helping to alleviate this problem. Extension of the classic B-tree data structure to the R-tree structure has allowed for the handling of multidimensional extended objects, of which spatial objects are a key component. Overall, while GIS developers are continually improving computational algorithms, the greater increases in data access within the system lie in the computer’s ability to access data from the hard disks and not the memory. 1.4 Database engines in general have underlying (a) data models and (b) query engines. (a) The data model of the SDBs can be implemented as an extension to the relational data model via the object relational extensions of the traditional relational databases. Object-relational databases (ORDSs) offer two new features over the traditional relational databases (1) new estended types and (2) new operations on these types. SDBs, at least in the OGIS model, can be viewes as ORDBs with (1) six types (point, line polygon, multipoint, multilane and multipolygon) and (2) a dozen new operations which are a subset of the operations defined by the 9-intersection model. (b) The ORDB query engines offer hookups for new indices and query optimization hint. SDBs implemented on OQDB type extensions can used these features for spatial indices and query optimization. 1.5 List the differences and similarities between spatial, CAD, and image databases? Similarities: 1. Use Object Relational database at the bottom layer. 2. Use spatial application as the top layer 3. User abstract data types Differences: 1. reference frames 2. data model 3. domain specific rules for query optimization Spatial, CAD and image databases all use Object Relational database at the bottom layer. They all use a spatial application as the top layer which interacts with OR-DBMS through spatial database. They all use abstract data types. They use different reference frames and data models. They use different domain specific rules for query optimization 1.6 The interplay between the vector and raster data models has often been compared with the waveparticle duality in physics. Discuss. The object model can be likened to particles in that in it, spatial entities are considered unique and separate from each other. The vector structure makes regions appear discrete, just as particles can be seen as parts of a whole. The field model is defined by functions that map spatial area to a given domain. The, the raster structure provides a way to demonstrate continuous behavior, as is seen in waves. 1.7 Cognitive maps are described as “internal representations of the world and its spatial properties stored in memory.” Do humans represent spatial objects as discrete entities in their minds? Is there a built-in bias against raster and in favor of vector representation? Humans generally think of spatial entities in a whole form. For example, when people think of a lake, they think it is a whole water surface instead of multiple “water points” that form the lake. However, in some special cases, people may think that a spatial entity consists of many points. For example, astronomers may represent a galaxy as a polygon in the sky map. But people may think that the galaxy is composed of many stars. All in all, there is a built-in bias against raster representation and in favor of vector representation in human mind. Actually either representation is just the model of the real world. That means either representation is just the simplification of the real world. Neither of them can express everything about the real world entity. Thus, in database applications and other applications that are involved in the spatial entities, both of them are useful. It depends on the specific application that which representation is used. 1.8. Why is it difficult to sort spatial data?? The essential problem is that spatial data has no natural sort order. As a consequence, many different sort orders for the same data set will then be available, and agreement on which one to use will be difficult. 1.9. Predicting global climate change has been identified as a premier challenge for the scientific community. How can SDBMS contribute in the endeavor? Response: Calibrating today’s complex climate change models requires an enormous amount of spatial and temporal data. As a result, global climate modelers benefit from SDBMS in a variety of ways. First spatially enabled databases provide a persistent and secure tool to input, store, and retrieve the spatial and temporal data. Because of the large volume of data being collected and maintained, it is economically efficient to have multiple clients (e.g., researchers at various institutions) access a small number of well maintained data repositories. SDBMS are designed for this type of multi-client access. Further, the majority of commercial and open source SDBMS use a standard query language (i.e., SQL), this allows clients to access spatial database servers relatively independent of software or platform constraints. With the advent of SQL3 (and other parallel query languages) clients can perform complex distance and topological queries in addition to standard non-spatial queries. Lastly, through advances in spatial object indexing, these queries can be executed with great efficiency. In addition to the data management, access, and canned attribute and spatial query capabilities, SDBMS provide researchers with a body of data that can be exploited thought data mining techniques to reveal relationships between exogenous variables and climate change. Originating in the marketing and management fields, data mining techniques offer enormous potential for modeling climate change and other environmental process. 1.10. E-commerce retailers reach customers all over the world via the Internet from a single facility. Some of them claim that geography and location are irrelevant in the Internet age. Do you agree? Justify your answer. We do not agree. There are several factors that are relevant to the geographical and location problems. - Shipping Suppose a Korean person wants to order an item from the E-commerce retailer whose warehouse is located in U.S. If the retailer website does not support international shipping, both sides will have difficulty. Thus the E-commerce retailer should consider its shipping boundary. - Tax If the retailer’s warehouse is located in Minnesota and a customer living in Minnesota wants to order an item from the site, the item will be imposed of tax. However, no tax is imposed for cross-states transaction. Thus the E-commerce retailer should consider tax problem according to customer’s location. - Location Relevant Item If the E-commerce retailer deals with real estate properties such as houses, the items in their website should have location information. Travel package is another example to be considered with location information. As you see in this example, item itself sometimes has a geographic property. 1.11. Define location-based and location-independent services, Provide examples. What is a location-based service? Location-based service is information service that provided by a device that knows where it is, and modifies the information it provides accordingly. For example, any device that can measure location, or any device with GPS, or any device whose location can be inferred, or any device that has knowledge of location and can use it to select, transform, modify information, can provide location-based service. Example of location-based services For example, the proposed E-911 emergency system requires cell phone operators to provide the government with the location of a 911 caller. In this case, the cell phone needs to be equipped with a device (such as GPS) that can provide the caller’s location. What is a location-independent service? Location-independent service is information service that does not need to know, and does not need to provide information about location. Example of a location-independent service: Certain website (server) provides music for users to download. Users can pay for the service using credit cards. Chapter 2 2.2. Weather forecasting is an example where the variables of interest – pressure, temperature, and wind – are models as fields. But the public prefers to receive information in terms of discrete, entities. For example, “The front will stall,” or “This high will weaken” Can you cite another example where is field-object dichotomy is apparent? Cite another example (weather forecasting was given) where this field-object dichotomy is apparent. Inside a car, the engine and other components are monitored and if their temperature is out of a given range, for example, the ‘Check Engine’ light inside the car lights up. The driver does not need to know at all times the status of each component, just if the car is running properly. 2.3. A lake is sometimes modeled as an object. Can you give an example in which it might be useful to model a lake as field? If we want model the depth in different part of the lake, it would be more appropriate to model a lake as field s, because there are no clear boundaries. Are lake boundaries are well defined? The boundaries of a lake are often not well defined. For instance, whether the extension in a river or a creek should be considered lake, and during flood season, the lake disappears as the inflow/volume proportion increases. 2.4: Match the columns: Nominal Ordinal Interval Ratio Color spectrum Social security # Temp. in Celsius Temp. in Kelvin 2.5 Design an ER diagram to represent the geographic and political features of the World. The World consists of three entities: country, city, and river. On the basis of these three entities answer the following questions: We assumed that every country has exactly one business center. We further assume that every river is owned by at least one country, but there may be countries that do not own any rivers. We also assumed that the database contains cities that are not business capitals. Moreover, in out model, we allow countries not to have diplomatic ties with any other country. Note, the thick lines on the diagram indicate full participation; the thin lines indicate partial participation. 2.6 Consider the problem of representing the OGIS hierarchy in OO language such as Java. How would you model inheritance? For example, MultiPoint inherits properties from both the Point and GeometryCollection classes. How will you model associations and cardinality constrains? Where will you us abstract classes? For associations we will need to use abstract classes since they connect two classes together and they are just interfaces. For the common functioning units we can use abstract classes. Inheritance will be modeled by defining a class inside of another class or by defining it as a subclass to a previously defined class. As far as cardinality, it can be resolved through multiple inheritances. 2.7 Which UML diagrams are relevant for data modeling? Do they offer any advantage over ER? How will you represent the following concepts in UML: primary key, foreign key, entities, relationships, cardinality constraints, participation constraints, and pictograms? UML helps provide information relevant to both the field and object data models. For starters, the field model is seen by the class diagrams and methods because specific objects are seen as belonging to a group (with corresponding functionality). Underneath this, objects are assigned unique identification by the system, but this is not shown in the UNL diagram. Instead, the object model is roughly represented by class attributes, although there is no guarantee that the attributes uniquely identify objects (e.g. they may have the same name). While UML does not provide a way to quickly tell what makes an object unique (as the ER diagram does by showing primary key attributes), this is unneeded since the objects that UML groups together have already been identified, and the schema designer does not really need to know how. Then, UML provides a distinct advantage over ER by showing how classes are related (e.g. through inheritance) and how a class is defined, which together make it easy to transfer a schema to object definitions. How to represent in UML? Primary key – Is not done (unnecessary, considering all objects are uniquely identified by the same attribute: oid). Foreign key – Since there is no primary key modeled, there is nothing to reference in another class. However, UML does model associations between entities. Entities – These become classes, and additional information (such as methods) are included. Relationships – Remain the same from the ER model except that M-M relationships do not necessarily require a new class (unless the relation adds attributes). Cardinality – Done the same way in UML. Participation – This is done with aggregation (to demonstrate the part-of-a-whole nature of some classes entities). The total participation needed in weak entities is not modeled. Pictograms – These are added to class diagrams just as they were for entities. 2.8 Model state-park example using UML 2.9. Classify the following into local, focal, zonal: a. slope: focal b. snow-covered park: local (Consider a point. Is is snow-covered?) c. site selection: focal (Is the neighborhood around point X suitable?) d. average population: zonal e. total population: zonal 2.10. Many spatial data types in OGIS refer to a planar world. These include line string and polygons. These may include large approximation errors while modeling the shape of Long River or larger counties on the spherical surface of earth. Propose a few spherical spatial data types to better approximate large objects on a sphere. Which spatial relationships (e.g. topological, metric) are affected by planar approximations? The metrics of large geographic features become distorted when projected onto a planer surface. In an attempt to minimize this distortion, cartographers have developed a variety of projection systems (e.g., UTM, State Plane Meter, Alberts). However no projection scheme completely removes distortion of very large spatial objections. Rather we might consider expanding the set of basic shapes (and their collections) to include additional higher dimension shapes. One such spatial data type might represent volume (and volume collections). A spherical world is three dimensional, which has implications for non-topological spatial relationships (i.e., metric attributes and relationships are distorted on a sphere). Geographers and cartographers are very familiar with the difficulties of projecting three-dimensional data into a two-dimensional model – changes include direction, distance, and area. By modeling space in three dimensions, we can avoid the compromises that projections demand – we can retain topology and also have correct direction, distance, and area. Volumes would be comprised of one or more polygons. While existing topological relationships would remain valid, new relationships (and query predicates) would necessarily come into existence. These would describe the various relationships between the volume (and volume collections) objects and the pre-existing simple objects. 2.11. Revisit the Java Program implementing the query about tourist-offices within ten miles of camp ground “Maple.” The main()in class FacilityDemo will slow down linearly as the number of tourisoffices increases. Device a variation of plan-sweep algorithm from Chapter 1 to speed up the search algorithm. The fTable array is sorted on the X-coordinate of the locations of the facilities. The method main (that finds the facilities within 10 miles of “Maples”) looks at the entries whose Xcoordinates are less than or equal to (X-coordinate of “Maple”) + 10 checking each time, whether it lies within 10 mile radius. This enables the method to stop scanning the array when the x-coordinate becomes greater than (X-coordinate of Maple) + 10. Assumption : Coordinates of ‘Point’ and distance are expressed in miles. public class FacilityDemo { public static void main(String[] args) { Facility f = new Facility(“Maple”, “Campground”, Point(2.0,4.0)); Facility[] fTable = new FacilitySet(“facilityFile”); String[] resultTable = new string[fTable.length]; Sort(fTable); //Sort fTable based on the X-coordinate of the location; int i = 0; double xu = 2.0 + 10.0; // X-coordinate of “Maple” + 10 while (fTable[i].Point.x <= xu && i < fTable.length) { f (f.withinDistance(fTable[i], 10.0) && fTable[i].type = “Tourist-Office”) resultTable[j++] = fTable[i].name]; i++; } //end while } We can cut down further on the scan if the facility “Maple” is in the array. Since the array is sorted, we can do a binary search for the first entry with x-coordinate = 2.0. Find the entry for “Maple” and scan in both directions till the X-coordinate of the facility is less than (X-coordinate (Maple) – 10) OR greater than (X-coordinate (Maple) + 10) This restricts the scan to the range (X-coordinate (Maple) – 10) < x < (X-coordinate (Maple) + 10) public class FacilityDemo { public static void main(String[] args) { Facility f = new Facility(“Maple”, “Campground”, Point(2.0,4.0)); Facility[] fTable = new FacilitySet(“facilityFile”); String[] resultTable = new string[fTable.length]; int j = 0; Sort(fTable); //Sort fTable based on the X-coordinate of the location; x = BinarySearch(fTable, 2.0); //Uses binary search to find the first entry with x // X coordinate= 2.0; returns the array index if (x == -1) return; // Assuming that search will return -1 on failure // Method will work only if “Maples” is in the array while (fTable[i].Point.x = 2.0 && fTable[i].Name <> “Maple” && x < fTable.length) x++; double xu = 2.0 + 10.0; // X-coordinate of “Maple” + 10 double xl = 2.0 - 10.0; // X-coordinate of “Maple” - 10 int i = x; while (fTable[i].Point.x <= xu && i < fTable.length) { if (f.withinDistance(fTable[i], 10.0) && fTable[i].type = “Tourist-Office”) resultTable[j++] = fTable[i].name]; i++; } //end while int i = x; while (fTable[i].Point.x >= xl && i >= 0) { if (f.withinDistance(fTable[i], 10.0) && fTable[i].type = “Tourist-Office”) resultTable[j++] = fTable[i].name]; i--; } //end while } // end main } //end FacilityDemo 2.12 Develop transaction rules to convert pictograms in a conceptual data model (e.g., ERD) to OGIS spatial data types embedded in a logical data model (e.g., Java, SQL) or physical data model Translation Rules to convent pictograms in a conceptual data model to OGIS spatial data types in a logical data model: 1) Syntax directed translation: Non-terminating elements need to be translated to terminating elements. For example: <pictogram> <shape> | <any possible shape *> | <user-defined shape ! > ( | demotes OR) <shape> <basic shape> | <multi-shape> | <derived-shape> | <alternates-shape> <basic shape> point | line | polygon <multi-shape> <basic shape> <Cardinality> <derived-shape> <basic shape> <alternate-shape> <basic shape> <derived-shape> | <basic shape> <basic shape> Cardinality quantifies the number of objects, so it indicates the presence or absenceof the objects. So the cardinality will result in integrity constraints. For example: <Cardinality> 0, 1 | 0, n | 1 | n | 1, n ( 0, 1 ) or ( 0, n ) denotes we have 0 or many objects, so NULL is allowed. ( 1 ), or ( n ), or (1, n ) denotes we have 1 or many objects, so it is NOT NULL 2) Translating entity pictograms to data types Entity pictograms inserted inside the entity boxes will be translated into appropriate data types. There data types are Point, Line, and Polygon. The translator using the syntax-directed translation will do this translation automatically. For example: <pictogram data type> PointType | LineType | PolygonType <Raster Partition> <Raster> | <Thiessen> | <TIN> 3) Translating relationship pictograms Relationships among geographic entities are actually conditions on the objects’ positions and are called spatial relationships. Relationship pictograms will be translated into spatial integrity constraints of the database. For example: <relationship> part_of (partition) | part_of(highrachical) 4) Translation of entities and relationships to tables (1) Map each entity onto a separate relation. The attributes of the entity are mapped onto the attributes of the relation. The key of the entity is the primary key of the relation. (2) For relationship whose cardinality is 1:1, the key attribute of any one of the entities is placed as a foreign key in the other relation. (3) If the cardinality of the relationship is M:1, then place the primary key of 1-side relation as a foreign key in the relation of M-side. (4) If the cardinality of the relationship is M:N, then each M:N relationship is mapped onto a new relation. The name of the relation is the name of the relationship, and the primary key of the relationship consists of the pair of primary keys of the participating entities. If the relationship has any attributes, then they become attributes of the new relation. (5) For a multi-valued attribute, a new relation is created which has two columns: one corresponding to the multivalued attribute and the other to the key of the entity that owns the multi-valued attribute. Together the multivalued attribute and the key of the entity constitute the primary key of the new relation. 2.15 OGIS data type Pictogram Point Line string Polygon Multipoint n Multiline string n Multipolygon n 2.17 Countries on a map of the world are represented as surface. They are modeled by polygons. Rivers are represented by curve. They are modeled by line strings. Lakes are represented by surface. They are modeled by polygons. Highways are represented by curve. They are modeled by multiline string. Cities are represented by point. As the scale of the map goes larger, the spatial data types change from one-dimensional objects to twodimensional objects. That is to say, it changes from point to string, from string to polygon. In order to represent scale dependence in pictograms, one could mark the map scale on each entity in the pictogram. 2.18 1. A single manager can only manage a single forest. 2. A manager must manage all the forest-stands in a forest (given that no manager can comanage a forest). 3. Many (or one) fire-stations can monitor a single forest (better safe than sorry). 4. The forest can have many facilities within its borders (“belonging”), and every facility can only be associated with one forest. // NOTE: For 5 and 6, no specific spatial relationship can be defined, but by knowing the object types involved, we can determine likely relationships. 5. River is connected to Forest by Road, so the spatial relationship is line→line→polygon We know that a river crosses a road and the road crosses the forest, but it may not be true that a river crosses a forest. If the two have any relationship at all, it might be “crosses” or “borders”. 6. The relationships are of the form polygon→polygon. In this case, we also know that many forest stands are within the forest. Then, the relationships are “contains” or “overlaps” (for when the boundary of a forest cuts through a forest_stand object). 2.20 Study the relational schema in Figure 2.5 and identify any missing tables for M:N relationships. Also identify any missing foreign keys for 1:1 and 1:N relationships in Figure 2.4. Should the spatial tables in Figure 2.6 include additional tables for collections of pints, line-strings or polygons? M:N relationships in Figure 2.4: Relationship Included in Figure 2.4? Facility – supplies_water_to – River Yes [Supplies_Water_To] River – crosses – Road Not Road – accesses – Forest Yes [Road-Access-Forest] The ‘River – crosses – Road’ relationship can be computed from the geometries, hence there is no need to explicitly include a table designated for this relationship. 1:1 relationships in Figure 2.4: Relationship [A – relationship – B] ‘A’ as foreign key in ‘B’ ‘B’ as foreign key in ‘A’ Manager – manages – Forest Manager has ForName Forest has no manager Facility – belongs_to – Forest Facility has ForestName Forest has no facility nm Strictly speaking, in a 1:1 relationship between entities A and B, it is enough to have one of the entities included into the table of the other i.e. it is enough to include the forest name in the Manager table. However, in the above example, if the manager of a forest is frequently queried, then instead of joining the Forest and Manager tables for every query, it may be advantageous to record the manager into the Forest table. This brings in redundancy and consequently issues of potential delete and update inconsistencies, but such a decision can increase the performance. [Physical Schema Tuning] 1:M relationships in Figure 2.4: Relationship [A – relationship – B] ‘A’ as foreign key in ‘B’ ‘B’ as foreign key in ‘A’ Facility – within – Forest Facility has ForestName2 Not possible. ForestStand – part_of – Forest Forest-name in Forest-Stand Not possible. FireStation – monitors – Forest ForName in Fire-Station Not possible. Since the relationships in question are 1:M, it is not possible to apply the physical schema design ideas described above. If we tried to insert all facilities that a certain forest contains into the Forest table, we would even violate the first normal form. On the other hand, if we stored the set of facilities as a multi-point, that is as a single object, then we could possibly create a Facility attribute in the Forest table to avoid repetitive computation of the join. In this case, in effect, we replace a 1:M relationship with a 1:1 relationship by representing a set of points as a single object (multi-point). The multi-object types are not necessary, but they can be beneficial as the above examples shows. 2.21 1. Alternate shape can be used to model roads because it can be linestring or polygon 2. Alternate shape can be used to model roads because it can be collection of points or collection of polygons 3. One entity RoadNetwork can be added. It has type of collection of linestring and collection of polygons and connected to entity Roads be relationship Partof 4. Manager entity can be modeled as point shape 2.22 I looked at the TIGER files, and tried to gleen as much information as possible from the documentation and the data dictionary that describes all of the record types that are included in TIGER/Line files. It appears to me that the schema would resemble a star schema like that of data warehouses. It seems to be made up of “county” data, which encapsulates addresses, zip codes, geographical features, landmarks, etc. So the diagram would look something like this: With the county fact table in the middle, and all of the other supporting tables linked into that one central data repository. 2.23 Answer1: The way that the ER diagram is presented in Figure 2.4 and 2.7 shows that several fire stations can exist, but each only monitors one forest (many-one relationship). To allow a fire station instance to monitor more than one forest the relationship would need to be changed form a many-one to a many-many. In the relational schema, the forest-name key in the relational schema would be drop and replaced with an attribute that notes the forest name that a given station might service. That is, allow a single station to respond to multiple forests. However, it might be preferable to drop the forest name attribute all together and add a attribute that specifies the maximum distance within which a fire station will respond to a fire. This approach is preferable because it explicitly considers the spatial relationship between the station’s location and the location of the fire. Once a fire is sited a simple nearest neighbor query could identify the stations that should respond. Answer2: Forest Fig 2.4 M N monitorss Fire-Station Fig 2.5 We need to create a relationship relation ‘Monitors’ between Fire-Station and Forest relations. Then, remove foreign key ForName from Fire-Station. Monitors FStaName (varchar) Fire-Station ForName (varchar) Name (varchar) Fig 2.7 Forest M N monitorss Fire-Station 2.24 Left and right sides of an object are inverted in mirror images, however top and bottom are not. Explain using absolute and relative diagrams. This question deals with the directional relationship between an object, a viewer, and a common coordinate system. The viewer sees the left/right switch when looking directly at the object versus looking at the object through the mirror. This switch is seen because the object’s (relative) left and right actually do switch when the object is rotated into the mirror. The top and bottom of the object stays constant because they are common to both the viewer and the object. In some ways the object represents a second viewer looking back at the observer (both maintain their own relative left and right and share a common up and down). Chapter 3 3.1 Express the following queries in relational algebra. a) Find all countries whose GDP is greater than $500 billion but less than $1 trillion. R Name ( GDP500 and GDP1000 (Country)) b) List the life expectancy in countries that have rivers originating in them. 1. R Life Exp (Country) 2. S Origin (River ) 3. c) RS Find all the cities that are either in South America or whose population is less than two million. R Name ( Cont SAM (Country)) 2. S Name ( Pop 2 (City)) 1. 3. d) RS List all the cities which are not in South America. R Name (City) 2. S Name ( Cont SAM (Country)) 3. R S 1. 3.2 (a) SELECT C.Name FROM Country C WHERE C.GDP > 500.0 AND C.GDP < 1000.0 (b) SELECT C.Name, C.Life-Exp FROM Country C, River R WHERE C.Name = R.Origin (c) SELECT Ci.Name FROM City Ci, Country Co WHERE Ci.Pop < 2.0 OR (Co.Continent = ‘SAM’ AND Co.Name = Ci.Country) (d) SELECT Ci.Name FROM City Ci, Country Co WHERE Co.Continent <> ‘SAM’ AND Co.Name = Ci.Country 3.3 Express in SQL the queries listed below A. Count the number of countries whose population is less than 100 million. SELECT COUNT(Co.Pop) FROM Country Co WHERE Pop < 100; B. Find the country in North America with the smallest GDP (use nested query) SELECT Co.Name FROM Country Co WHERE GDP <= ALL (SELECT Cou.GDP FROM Country cou WHERE Cou.Cont=NAM); C. List all countries that are in North America or whose capital cities have a population of less than 5 million. SELECT DISTINCT C.Name FROM Country C LEFT JOIN City CI ON (CI.Country = C.Name) WHERE C.Cont = 'NAM' OR (CI.Capital = 'Y' AND CI.Pop < 5); D. Find the country with the second highest GDP. SELECT Name FROM Country WHERE GDP >= ALL ( SELECT GDP FROM Country WHERE GDP < (SELECT MAX(GDP) FROM Country) ) AND GDP < (SELECT MAX(GDP) FROM Country); 3. 4 (pp.76) The Reclassify is an aggregate function that combines spatial geometries on the basis of nonspatial attributes. It creates new objects from the existing ones, generally by removing the internal boundaries of the adjacent polygons whose chosen attributes is same. Can we express the Reclassify operation using OGIS operations and SQL92 with spatial datatypes? Explain. We can express the Reclassify operation using OGIS union and touch operations. We can use the touch operation to test if the polygons are adjacent, and use the union operations to remove the internal boundaries, based on test of the nonspatial attributes. For example, if we want to reclassify the map of countries on the basis of the majority religion practiced in the countries, the touch operation can test if the two countries are adjacent, if true, we can using SQL92 to test if the two countries have the same majority religion, if true, then the boundaries between the neighboring countries are removed through union operation. 3.5: Discuss the geometry data model of Figure 2.2. Given that on a “world” scale, cities are represented as point datatypes, what data type should be used to represent the countries of the world. Note: Singapore, the Vatican, and Monaco are countries. What are the implementation implications for the spatial functions recommended by the OGIS standard. The OGIS geometry model for representing spatial entities in spatial databases provides a general Geometry shape defined on a particular spatial reference system. This general Geometry shape breaks down into four basic shapes – zero-dimensional points, one-dimensional curves, two-dimensional surfaces, and geometry collection. The last shape is essentially a collection of the other three basic shapes that provides closure. Based on Figure 2.2, we see that points make up line strings (which approximate curves), and line strings make up surfaces. On a world scale, points will represent cities and surfaces will represent countries, no matter how small the countries are. In a spatial database, we should store countries, even small ones like the Vatican City, San Marino, and Monaco, as surfaces so that we can nest cities (e.g., Monte Carlo in Monaco) within the surfaces. This nesting allows us to execute point-in-polygon queries on the database. Also, we should store all countries as the same data type – storing some countries as points and some as polygons makes it difficult to perform spatial analysis. Maintaining entities, like countries, in the same data type should ease the implementation headaches of the OGIS spatial functions. Now, if we want to make a world map of countries from this spatial database, the small size of some countries will make it difficult to render them on a regular sheet of paper. Thus, it may make sense to convert small countries to points for the map. Ultimately, within the spatial database, we should maintain countries as surfaces for analytical and topological processing. Moving from the spatial database to a map, however, may require transformations among data types. 3.6 [Egenhofer, 1994] proposes a list of requirements for extending SQL for spatial applications. The requirements are shown below [Refer to table on page 77]. Which of these recommendations have been accepted in the OGIS SQL standard? Discuss possible reasons for postponing the others. Spatial Abstract Data Type: implemented. Graphical Presentation: The results returned by the SQL engine are tables not graphical images. Consider a query like ‘ Summarize the distances of the capital cities for the Equator’. The SQL is the interface to the DB, not the interface to the user. Result Combination: The queries are closed, Hence their results can be combines. Implemented. Context: SQL is a mathematically sound language; it can not just haphazardly include extra information. Content examination: SQL is not a map drawing tool. Graphical user interfaces, map drawing extensions, querying by a mouse are to be implemented by middle-ware. Selection by pointing: The above applies. Legend: SQL’s responsibility is to retrieve data from the database in a tabular form. Putting legends onto the data is not the SQL engine’s responsibility; but that of some visualization engine’s. Label; You can explicitly store label in a database if you want to. SQL should not return these labels unless asked for , for reasons explained under ‘Context’ Selection of map scale: Mapping is not the SQL engine’s responsibility/ Area of interest: You can restrict your area of interest by using the WHERE clause. 3.7 The OGIS standard includes a set of topological spatial predicates. How should the standard be extended to include directional predicates such as East, North, North-East, and so forth. Note that the directional predicates may be fuzzy: “Where does North-East end and East begin? By applying distance operator to a directional neighborhood graph. In this case predicates East, North, North-East etc would determine the distance between two directional relations with the use of shortest path. Distance would be determining where a direction starts and where it ends. 3.8 (p. 76 – P. 78) Assumptions: For this assignment, T means that the relation will hold for all spatial types (e.g. In the case of TOUCH, we have T on Bound(A) ∩ Bound (B). The intersection of these two is a point or line, but that is not described here. Instead, T means that the boundaries of two points, two lines, or two polygons must intersect, not HOW they intersect). Secondly, * in the description denotes a value we don’t care about (that may or may not intersection) Logically, in some intersections, we can intuit that the objects MUST or MUST NOT intersect, and thus, we do not need to set these “extraneous” intersections as T or F. For example, in TOUCH, we have True on Bnd(A) ∩ Bnd (B), Int(A) ∩ Ext(B), and Ext(A) ∩ Ext(B), and we have False on Int(A) ∩ Int(B). These rules imply that Int(A) ∩ Bnd(B) will be False, and hence, we mark this as * since we did not need to know this fact in order to model the relation. The matrices for the general cases are: a.) Touches F * T * T * T * * Crosses T * T * * * * * * b.) This matrix is the overlap operation where two lines have the same direction, but different lengths (e.g. River overlaps a tour boat route) c.) Disjoint F F * Contains T * * F F * * * * Inside T * F * * * F F * Equal * * F * * * T * F * T * F * T Meet Covered By F * T * T * T * * Covers T * T T * F * T * F * * Overlaps * T * F * * T * T * T * T * T 3.9 Express the following queries an SQL, using the OGIS extended datatype and functions. (a) List all cities in the City table which are within five thousand miles of Washington, D.C. select C1.Name, C1.Country,C1.Pop from City C1,City C2 where C2.Name=”Washington DC” and DISTANCE(C2.Shape,C1.Shape)<500 miles (b) What is the length of Rio Panamas in Argentina and Brazil? select R.Name,sum(length(Intersection(R.Shape,C.Shape))) from River R,Country C where Cross(R.Shape,C.Shape)=1 and (C.Name=”Argentina” or C.Name=”Brazil”) (c) Do Argentina and Brazil share a border? select Touch(Co.1Shape,Co2.Shape) from Country Co1,Country Co2 where Co1.Name=”Argentina” and Co2.Name=”Brazil” (d) List the countries that lie completely south of equator. select Co1.Name from Country Co1 where 0 < select (max (Co2.Shape.LineaRing.LineString.y1) from Country Co2)) 3.10 (P. 78) a) Names of all rivers that cross Itasca State Forest -SELECT river.name FROM river, forest WHERE Cross(river.geometry, forest.geometry) b) Names of all tar roads that intersect Francis Forest -SELECT road.name FROM road, forest WHERE road.type = "tar" AND Intersect(road.geometry, forest.geometry) c) Names of all roads with stretches within the floodplain of the river Montana -SELECT road.name FROM road, river WHERE Intersect(road.geometry, river.flood-plain) d) IDs of land parcels within 2 miles of Red River or 5 miles of Big Tree S.P. -SELECT land-parcel.id FROM land-parcel, river, forest WHERE (Intersect(land-parcel.geometry, Buffer(river.geometry, 2)) AND river.name = "Red") OR (Intersect(land-parcel.geometry, Buffer(forest.geometry, 5)) AND forest.name = "Big Tree") e) Names of rivers that define part of the boundary of a county -- not possible given current schema. If a county table were present: SELECT UNIQUE river.name FROM river, county WHERE Touch(river.geometry, county.geometry) County table could be built as a view from land-parcel table, if Union() took many arguments, but the table on page 66 defines it as taking only two. 3.11 Study the compiler tools such as YACC (Yet Another Compiler Compiler). Develop a syntax scheme to generate SQL3 data definition statement from the Entity Relationship Diagram annotated with pictograms. The Entity Relationship Diagram (ERD) with pictograms provides a simple and clear way to model spatial and non-spatial data relationships. Pictograms can convey the spatial data types, the scale, and implicit relationships of spatial entities. These implicit relationships are topological, directional, and metric relationships. Pictograms are graphical representations of data type objects, typically these are abstract data types. The objects can be basic shapes, multi-shapes, derived shapes, alternative shapes, and potentially a user defined object. All objects are described by their unique graphic symbol. There is a set of rules that allow pictograms to convey information about multishapes, derived shapes, and alternative shapes (i.e., shape representation and behavior changes with scale). In a ERD that uses pictograms, it is not necessary to explicitly depict the relationships between spatial objects. Rather these relationships are object specific and therefore implicit. The SQL3 allows the user to define abstract data types much like classes are defined in many object oriented languages (e.g., c++, python, or java). Specifically, there is a data type name (i.e., class name) followed by member data types, then member functions. These member functions are used to reveal, alter, or perform operations on, or with, other objects. SQL3 provides provisions for a form of inheritance. In this way a basic shape, like a point, can be used to construct more complex shapes such as lines or polygons. An ERD with embedded pictograms, and SQL3, provide complementary tools to conceptualize objects/object relationship and manipulate these data types in a spatial database. The challenge is to develop a set of rules (i.e., grammar) to adequately transfer the pictogram representation of objects into valid SQL3 object declarations. Yet Another Compiler Compiler (YACC) is a low level programming language that allows users to make a language (i.e., set of rules) to structure program input. In addition to defining the grammar elements of their language, users can also specific functions that are invoked when certain rules are satisfied (e.g., a function is called when a token word or character is recognized in the program input parameters). A set of rules that can be interpreted by a program such as YACC can bridge the gap between the graphical representation of objects in the pictogram and the SQL3 syntax. For instance a set of rules could be written that recognizes the pictogram point specifications: input parameters: <pictogram><shape><basic shape><point><cardinality argument>; The set of user specified grammar rules and associated data members and member functions (e.g., distance()), could be used to translate the input parameters into SQL3 syntax. Example output in SQL3 form: CREATE TYPE Point (x NUMBER, y NUMBER, FUNCTION distance (: u Point,u Point) RETURNS NUMBER); In this manner, YACC type rules could be defined for all OGR data types and user defined data types. 3.12 How would one model the following spatial relationship using 9-intersection model or OGIS topological operations? (a) A river (LineString)originates in a country (Polygon) SELECT R.Name FROM River R, Country C WHERE CONTAINS(C.Shape, R.Origin) = 1; (b) A country is completely surrounded by another country SELECT C1.Name FROM Country C1, Country C2 WHERE WITHIN(C1.Shape, C2.Shape) = 1; (c) A river falls into another river SELECT R1.Name, R1.Name FROM River R1, River R2 WHERE INTERSECT(R1.Shape, R2.Shape) = 1; (d) Forest stands partition a forest SELECT F.Name FROM Forest F, Forest_Stand FS WHERE CROSS(FS.Shape, F.Shape) = 1; 3.13. Review the example RA queries provided for state park database in Appendix. Write SQL expressions for each RA query. Example Query: Find the names of the StatePark which contains the Lake with Lid number 100 CREATE VIEW Lake100 AS SELECT Pl.Sid FROM ParkLake Pl WHERE Pl.Lid = ‘100’ SELECT Sp.Sname FROM Lake100, StatePark Sp WHERE Pl.Sid = Sp.Sid 1) Query: Find the names of the StateParks with Lakes where the MainCatch is Trout. CREATE VIEW TroutLake AS SELECT La.Lid FROM Lake La WHERE Main-Catch = ‘Trout’ SELECT Sp.Sname FROM TroutLake, ParkLake Pl, StatePark Sp WHERE Pl.Lid = La.Lid AND Sp.Sid = Pl.Sid 2) Query: Find the Main-Catch of the lakes that are in Itasca State Park CREATE VIEW ItascaParkId AS SELECT Sp.Sid FROM StapePark WHERE Sname = ‘Itasca’ SELECT La.Main-Catch FROM ItascaParkId, ParkLake Pl, Lake La WHERE Sp.Sid = Pl.Sid AND La.Lid=Pl.Lid 3.18 (p.79) Revisit relational schema for state park example in Section 2.2.3. Outline SQL DML statements to create relevant tables using OGIS spatial data type. CREATE TABLE Forest-Stand ( Stand-id Integer, Species varchar(30) Forest-name varchar(30) Shape Polygon PRIMARY KEY(Stand-id)); CREATE TABLE River ( Name varchar(30) Length Real Shape Line PRIMARY KEY(Name)); CREATE TABLE Road ( Name varchar(30) NumofLanes Integer, Shape Line PRIMARY KEY(Name)); CREATE TABLE Facility ( Name varchar(30), Forest-name varchar(30) Forest-name-2 varchar(30) Shape Point PRIMARY KEY(Name)); CREATE TABLE Forest ( Name varchar(30) Shape Polygon PRIMARY KEY(Name)); CREATE TABLE Fire-Station ( Name varchar(30) Forest-name varchar(30) Shape Point PRIMARY KEY(Name)); CREATE TABLE Supplies_Water_To ( FacName varchar(30), RivName varchar(30) Volume Real PRIMARY KEY(FacName, RivName)); CREATE TABLE Road-Access-Forest ( RoadName varchar(30), ForName varchar(30) PRIMARY KEY(RoadName, ForName)); CREATE TABLE Manager( Name varchar(30), Age Integer Gender varchar(30) ForName varchar(30) PRIMARY KEY(Name, Age, Gender)); 3.19. Consider shape-based queries, for example, list countries shaped like ladies boot or list squarish census blocks. Propose extensions to SQL3/OGIS to support such queries. Shape-based queries, which belong to the family of content-based retrieval, have not been modeled by OGIS data model yet. It is clearly a nontrivial task and is now a topic of intense current research. Intuitively, for shape-based queries, suppose we have already had the target shape clearly defined (such as “squarish census block”, or “ladies boot” in this case), we can simply compare the shape of Query Shape with Target Shape. However, we can not do this directly by using OGIS operators such as Overlay or Difference, because of the following 2 reasons. 1) The Query Shape and the Target Shape may not be of the same size. 2) The Query Shape and the Target Shape, even they might be of the same size, or same shape, might be of different orientations. So before we compare the Query Shape and the Target Shape, we first need some translation, rotation, and rescaling operations. A simple idea is as follows: 1) Find out the long axis of both the Query Shape and the Target Shape. If the orientations of the long axis of these two different shapes are different, then we can first Rotate the Query Shape, and make its long axis match with that of the Target Shape. 2) After the rotation, we can Resize the Query Shape, and make it approximately the same size as the Target Shape. 3) After make the Query Shape and the Target Shape of the same size, and same orientation, we can use OGIS operator “Difference”, and find out the portion of the geometry of the Query Shape that does not match with the Target Shape. 4) We can define a threshold value, and decide whether the “Difference” of these two shapes is small enough so that we can say that they are of the same shape. Current studies on Shape-Based Retrieval have invented various algorithms for shape description and comparison. Fourier descriptors (FD), curvature scale space (CSS) descriptors (CSSD), Zernike moment descriptors (ZMD), and grid descriptors (GD) are some of the examples. For example, Fourier descriptors are obtained by applying Fourier transform on shape boundaries. Both the Query Shape and the Target Shape are represented using a set of coordinates. Then the centroid distance function is applied so that the centroid distance representation is invariant to shape translation. The centroid distance function is as follows: ri = ([xi – xc]2 + [yi – yc]2)1/2, i = 1, 2,…, L, where xc and yc are average of x coordinates and y coordinates respectively. In order to apply Fourier transform, all the shapes in database are normalized to the same number of boundary points. Then discrete Fourier transform is applied. The coefficients calculated from Fourier transformation are called Fourier Descriptors. The FDs acquired in this way is translation invariant due to the translation invariance of centroid distance. To achieve rotation invariance, phase information of the FDs is ignored and only the magnitudes |FD n| are used. Scale invariance is achieved by dividing the magnitudes by the |FD0|. The similarity measure of the query shape and a target shape in the database is simply the Euclidean distance between the Query Shape and the Target Shape feature vectors. Chapter 4 4.3 a) R-tree duplicates pointers to these large objects across multiple nodes of an index. It is a search down multiple path in the index and duplicate elimination postprocessing. While grid-file decompose the large objects into smaller fragments in its grids. It leads to additional postprocessing to merge fragments to reconstruct the object. b) The strength of the scheme is that it’s more convenient for people to handle large spatial objects. It can also speed up the search of a data file by indexes each group independently. The weakness is that with the extension of the size of spatial objects, search time will be slow. 4.4 Compare and contrast the following terms: a)Clustering vs. Indexing Clustering and indexing are both used to speed up the DBMS data retrieval performance, they differ in the way they accomplish the improvements. Clustering attempts to organize the data on disk in such a way that it will be easy for the DBMS to locate geographically close objects on disk. Indexing does not concern itself with how the records are stored on disk, rather, it maps key fields to disk pages, so for every distinct key in the database, the DBMS can quickly look up its key field in the sorted index file, and immediately know where on disk the object resides. b) R Trees vs. Grid File Grid files are simple fixed grids that are laid over a structure that divides the structure into the grid units. A more advanced type of grid file is a non-uniform grid, which like the name specifies means that the grids can be different sizes. Grid files rely on a grid directory which is simply of place to keep track of which bucket each data entry is located. Grid files are efficient for I/O costs but require a large main memory for the grid directory. An R-tree represents spatial data in the form of a B-tree. Like B-trees there are different kinds of R-trees (R+-trees for example). In an R-tree each spatial item is thought of as a rectangle. The rectangles then are subdivided in the same manner as data is for B trees except the rectangles are allowed to overlap. The overlapping rectangles are dealt with in different ways depending on the type of R-tree used. In an R tree each rectangle can only have parent node. c) Hilbert Curve vs. Z-Curve These are two clustering methods that are used in spatial databases; to try and keep geographically close objects close to each other in the database. The Z-Curve is named for the ‘Z’ pattern that is traced when you access sequential objects from the database. An object Z-value is determined by interleaving the bits of its x and y coordinates, and then translating the resulting number into a decimal value. This new value is the index on the path that the object will be on. Hilbert curves accomplish the same idea, but the method is slightly different. The Hilbert value of an object is found by interleaving the bits of its x and y coordinates, and then chopping the binary string into 2-bit strings. Then, for every 2-bit string, if the value is 0, we replace every 1 in the original string with a 3, and vice-versa. If the value of the 2-bit string is 3, we replace all 2’s and 0’s in a similar fashion. After this is done, you put all the 2-bit strings back together and compute the decimal value of the binary string; this is the Hilbert value of the object. With Z-Curves, there are some instances where we move from one object to another that is not adjacent, with Hilbert curves we avoid this inefficiency. However, it is possible for objects that are clustered using a Hilbert curve to be geographically adjacent, but not adjacent in the trace of the Hilbert translation. Hilbert curves are generally better than Z-Curves because the Hilbert curves don’t have any diagonal lines but the Hilbert curve costs more computationally than the Z-curve. d) Primary Index vs. Secondary Index Both of these indices are 2 field files that specify what page on disk a particular record resides. The difference is that with a primary index, the database table is ordered on the disk, so only the first record of each disk page is needed, because when you search the primary index for a certain record, you just need to find the page its on, not the exact location, and since the file is ordered, the index can be much smaller. With a secondary index, the files are unordered, so every unique key must have an index entry because any row of the table can be on any disk page. 4.5 (pp.112) Which of the follow properties are true for R-trees, grid files, and B-trees with Z-order? Balance R-tree, B-trees with Z-order Fixed depth Grid file Nonoverlapping R-tree, Grid file, B-trees with Z-order 50 percent utilization per node B-trees with Z-order 4.6 a) Create an final grid structure for the following data points. Assume each disk page can hold at most three records. Clearly show the dimension vector, directory grids, and data page. The data points are: (0,0), (0,1), (1,1), (1,0), (0,2), (1,2), (2,2), (2,1), (2,0). The utilization rule is 50%. b) Create an R-tree for the following data points. Assume each intermediate node may point to three other nodes or leafs and each leaf may point to three data points. The utilization rule is 50%. The data points are: (0,0), (0,1), (1,1), (1,0), (0,2), (1,2), (2,2), (2,1), (2,0). 4.7 Draw a final grid file structure generated by inserting records with the following sequence of point-keys: (0,0) (1,1)(2,2)(3,3)(4,5)(5,5)(5,1)(5,2)(5,3)(5,4)(5,0). Assume each disk page can hold at most two records. 4.8 Repeat the above with B-tree with Z-order. 111 110 C * 101 B * 100 B * 011 001 000 D * D * *A D * *A 000 Object A A B B B C C C C * B * 010 C * 001 Points 1 2 1 2 3 1 2 3 010 x 000 001 010 011 100 101 101 101 011 y 000 001 010 011 100 101 100 011 100 101 interleave 000000 000011 001100 001111 110000 110011 110010 100111 110 z-value 0 3 12 15 48 51 50 39 111 D D D 1 2 3 101 101 101 010 001 000 100110 100011 100001 38 35 33 4.9 Repeat the above question with R-tree. The points are arranged as: We can assume that each point represents a MBR 5 * 4 ** 3 * * 2 * * 1 * * 0 * * 01 2345 We know that each index holds three pointers, and leaf nodes can hold only two records. Since we have 11 points, we need to split the area into 6 areas to contain the points. This is our split diagram: 5 |* 4 *|* 3 2 * And after we assign labels: 5 A | X 4 | B * |* |* 3 2 C | | D |* |* 3 2 E | Z | F 1 * 0 * 01 234|5 Y 01 234 |5 Here, X covers the area of A and B, Y covers the area of C and D, etc. Then, the R-Tree representation is: ●R ○X ○Y ○A ○B ○C ○ (4,4) ○(5,5) ○(5,4) ○(2,2) ○(3,3) ○Z ○D ○(5,3) ○(5,2) ○E ○(0,0) ○(1,1) ○F ○(5,1) ○(5,0) 4.10 Consider a set of rectangles whose left-bottom corners are located at the set of points given in the previous question. Assume that each rectangle has length of three unties and width of two units. Draw two possible R-trees. Assume an index node can hold at most three pointers to other index nodes. A leaf index node can hold pointers to at most two data records (Group 7 submitted as hard copy) 4.11 Compute the Hilbert function for a 4x4 grid shown in Fig. 4.9 with the origin changed to bottom-left corner. 11| 0101 0111 1101 1111 | 10| 0100 0110 1100 1110 | 01| 0001 0011 1001 1011 | 00| 0000 0010 1000 1010 -------------------00 01 10 11 11| 11 12 21 22 | 10| 10 13 20 23 | 01| 01 02 31 32 | 00| 00 03 30 33 --------------00 01 10 11 11| 11 12 21 22 | 10| 10 13 20 23 | 01| 03 02 31 30 | 00| 00 01 32 33 --------------00 01 10 11 11| 5 6 9 10 | 10| 4 7 8 11 | 01| 3 2 13 12 | 00| 0 1 14 15 --------------00 01 10 11 4.12 What is an R-Link Tree? What is it used for in Spatial Databases? R-Link Trees are a modified R-Tree indexing technique. They are the spatial analogue of the B-Link Tree, developed to handle concurrent access to underlying data. Basically, the leaves nodes are connected in a linked list. Nodes are assigned a logical sequence number, with each subsequent number being larger than those previous. When a node is split, the old number goes to the new node, while a new number is assigned to the older node. Effectively, the linked list runs from left to right, and the assigned numbers should decrease. Should a split occur and not yet be detected by the parent level, transversal can reveal this by finding a node which is out of order, since no parent node would currently point to it. 4.13 (Group 9 did not answer this) 4.14 What is a join-index? List a few applications where join-index may be useful A join-index is a data structure used for processing join queries in databases. A join-index describes a relationship between the objects of two relations. Example: Assume that each tuple of a relation has a surrogate (s system-defined identifier for tuples, pages, etc.) that uniquely identifies that tuple. A join-index is a sequence of pairs of surrogates in which each pair of surrogates identifies the result-tuple of a join. The tuples participating in the join result are given by their surrogates. Let A and S be two relations, now we consider the join of R and S on attributes A of R and B of S. Formal definition of join-index could be expressed as: JI = {(ri, sj) | F (ri.A, sj.B) is true for ri 'R and sj ' S} Join-indices use pre-computation techniques to speed up online query processing and are useful for datasets that are updated infrequently. Applications that join-index might be used: 1) For example, in the StatePark example, since info about StatePark and Lake is usually stable, and need to be updated infrequently, we can creat a join-index between StatePark and Lake, and pre-compute the join-result, so that if people want to perform query on them, query processing will be speeded up a lot. 2) To study the relationship between HazardousSite, and the Species in the ForestStand, a join-index can also be created between the databases of ForestStand, and BufferOfHazardousSite. 4.17 a. Nodes b and c b. Nodes a, c, d, e, f 4.18 a.) For a point query, we would need to pull rectangle 5 from nodes b and c in intermediate node x. b.) The range query performs the same as in 4.17. Nodes a, c and d within intermediary node x and nodes e and f in intermediary node y. 4.20. Compute Hilbert values for the pixels in object A, B, and C in Figure 4.7. Compare the computational cost of range searches for objects B and C using Hilbert curve ordered storage. x 00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11 y 00 00 00 00 01 01 01 01 10 10 10 10 11 11 11 11 interleaved 0000 0010 1000 1010 0001 0011 1001 1011 0100 0110 1100 1110 0101 0111 1101 1111 Decimal Updated decimal 00 03 30 33 01 02 31 32 10 13 20 23 11 12 21 22 00 01 32 33 03 02 31 30 10 13 20 23 11 12 21 22 Updated binary 0000 0001 1110 1111 0011 0010 1101 1100 0100 0111 1000 1011 0101 0110 1001 1010 Hilbert value 0 1 14 15 3 2 13 12 4 7 8 11 5 6 9 10 Assuming that the query is to retrieve all objects in the regions of A, B and C, and furthermore assuming that the objects are stored in a B+-tree ordered in Hilbert-order with 16 leaves and an effective fanout of f, the strategy of executing the query is as follows. 1. Convert the regions into Hilbert values. A={5}, B={8, 9, 10, 11}, C={1,14}. 2. Find maximal contiguous series of Hilbert values. If the series are contiguous, it is enough to follow the links between the leaves; otherwise we have to search the entire tree. A={{5}}, 3. B={{8, 9, 10, 11}}, C={{1},{14}} Accordingly, our method of retrieving the objects is to descend from the root to 1 and retrieve all objects from 1, then descend from root to 5, retrieve all objects from 5. Next, from the root, descend to 8, retrieve all objects from 8, then move on to 9, 10, 11 retrieving all objects from 9, 10, 11. Finally, from the root, descend to 14 and retrieve all objects from 14. Figure 1: Left: The Hilbert-values for Figure 4.7 in the SDB book Right: The range queries. The red arrows indicate the leaves where the tree has to be searched. Other leaves can be accessed by following the links between adjacent nodes in the B+tree. The cost of this formulation is 4 * log f 16 3. disk pages. 4.22. Since this problem is exactly the same as Question4 in the mid-term exam, so I will skip it. Chapter 5 5.3. What is special about query optimization in SDBs relative to traditional relational databases? The three major differences between spatial and traditional (non-spatial) databases are 1. Lack of fixed set of operators 2. Lack of natural spatial ordering and 3. Evaluation of spatial predicates is more expensive. The consequences of these differences are that (a) the system does not know the cost of user-defined operators, (b) due to the lack of natural ordering new access methods are needed and (c) the premise that I/O cost is the dominant cost (relative to the CPU cost) may not hold. Optimizers worked around these problems by (i) introducing the filter-and-refine paradigm and (ii) developing new access methods. The Filter-and-refine paradigm. Instead of handling the potentially complex objects directly, the SDB engine stores and processes the minimum bounding rectangles (MBRs) of these objects. In the filter step, which is performed by the engine itself, only the MBRs are considered with a restricted set of operators [overlap, join, NN operator] and the evaluation of the actual predicate on the complex objects themselves are performed in the refine step, which responsibility is usually delegated back to the application. This paradigm solves (a) by using only a limited set of spatial operators [usually overlap], for which the cost is known and reasonably low when applied to MBRs. This also solves (c). New Spatial Access Methods. (i) Either an ordering is forced onto the multi-dimensional space [such as Hilbert curve] or (ii) inherently multi-dimensional indices are used [e.g. R-tree]. 5.4 a. With no R-Tree indices available, we are forced to use the brute force nested loop join strategy, which will compare all possible joined tuples. b. When one relation has an index we can use the nested loop with index. By using that relation with the index as the inner loop of the nested loop we do not need to scan the whole relation each iteration, we can reduce it to a range scan of the index. c. If both relations have an index, we can take advantage of the tree matching strategy. By using the RTree indices for each relation, we can achieve a lower bound of the I/O cost = IR1 + IR2 where IRn is the number of data pages for the relation Rn. 5.5 Which join strategies can be used for spatial join, with join predicates other than overlap? Nested Loop – check every possible pair Tree Matching – Use R-Tree indices on both relations (if available) Space Partitioning – Objects are compared if they share a region. 5.6 Tree transformation: draw query trees for the following queries: a) which city in city table is closest to a river in the river table (section 3.5)? put SQL equivalent. b) List countries where GDP is greater than that of Canada c) List countries that have at least as many (lakes/neighbors) as France. 5.7 Apply the tree transform rules to generate a few equivalent query trees for each query tree generated above [in problem 5.6](Group3 did not answer this questions) 5.8 Consider a composite query with a spatial join and two spatial selections. However, this strategy does not allow the spatial join. Take advantage of the original indexes on lakes and facilities. In addition, does writing and reading of file 1 and file 2 add into costs? Devise an alternate stratey to simultaneously process njoin and the two selections underneath. The query has two spatial selection and one spatial join and both of the files have indexes. If spatial indexes are available on both relations, the tree matching strategy can be used. If IR1, IR2 the number of pages occupied by the nodes of the index tree for relations R and S respectively, the minimum I/O cost is IR1+IR2. The total cost is I/O cost plus the join cost IR1+IR2 + ((js * |R| * |S|) bfrRS) The notation: js --- join selectivity |R| ---number of records in R |S| --- number of records in S bfrRS ---blocking factor for join 5.9 We focused on strategies for point-selection, range-query selection, and spatial join for overlap predicates in this chapter. Are these adequate to support processing of spatial queries in SQL3/OGIS? They provide basics but as it is mentioned in chapter three, operations are limited to simple select-projectjoin queries. Support for directional predicates is missing. Spatial aggregate queries poses problems. It also needs support on shape-based and visibility- based operations. 5.10 How can we process various topological predicates (e.g. inside, outside, touch, cross) in terms of the overlap predicate as a filter, followed by an exact geometric operation? For this, we can use MBRs since they provide an affordable way to approximate geometry, giving a reliable transformation from an MBR relation to the spatial region contained within. By applying the overlap operation to these, and judging by figure 2.3 on page 30, we can see that this would eliminate both disjoint and meet from consideration should the operation fail. This essentially saves time by removing clearly disjoint objects before advancing onto the exact geometry test with the remaining 6 operations. 5.11 Design an efficient single-pass algorithm for nearest neighbor query using one of the following storage methods: Grid file, B+ tree with Z-order. The search algorithm for nearest neighbor starts with grid files and locate with the buckets. The grid directory entries share the same bucket. It means points or records belong to these regions are stored in the same bucket. The algorithm uses search pruning strategies. 5.12 (Group 8 did not answer this) 5.13 Compare the computational cost of the two-pass and one-pass algorithms for nearest neighbor queries. When is the one-pass preferred? The cost for two-pass is clearly the cost for a point search added to the cost for a range query. The cost for a point search is O(logBn), where B is the blocking factor. The cost of the range query is O(logBn) + (Query-size / Clustering eff.) Therefore the entire cost would be approximately 2 * ( logBn) + (Query-size / Clustering eff.). The cost for a one-pass is the cost of tree transversal: O(log n). One pass would be preferable when the blocking factor is high and the data set sufficiently sparse. 5.14 Compare and contrast (i) parallel Vs. distributed databases. Parallel database management systems utilize parallel processor technology employing one of the three main architectures namely a)shared memory architecture b)shared disk architecture c) shared nothing architecture. Distributed database is a collection of multiple, logically interrelated databases distributed over a network. Here, the architecture could be a client-server system or a collaborating system. Although, the share-nothing architecture in parallel databases resemble distributed database system, there are differences in the mode of operation. In parallel systems, there is symmetry and homogeneity of nodes whereas in distributed systems, heterogeneity of hardware and operating system at each node is common. (ii) declustering Vs dynamic load balancing Declustering refers to dividing the set of data items in query processing among various disks such that response time is minimized. Depending on when data is divided and allocated, declustering methods fall into two categories : static load balancing where partitioning is done before computation and dynamic load balancing where data is partitioned during run time. Most often, both static declustering and dynamic load balancing are required because of highly non-uniform data distribution, large variation in size and extent of spatial data. (iii) shared nothing Vs shared disk architectures In shared nothing (SN) architecture, each processor is associated with a memory and some disk units that can only be accessed by that processor. In shared disk (SD) architecture, each processor has a private memory that can be accessed only by that processor, but all processors in the system can access all the disks in the system. SN architecture minimizes the interferences among different processors and hence scalability is improved. But load balancing could be a greater challenge than in other two cases; also, data on a disk could become unavailable if a processor fails. SD architecture, is halfway between SN and SM as far as resource sharing goes; hence is more scalable than SM and load balancing is easier than in SN. Communication overhead is less than in SN and synchronization is easier. (iv) client server Vs collaborative systems In a client server system, there are one or more client processes and one or more server processes; a client process can send a query to any server process. Clients are responsible for user interface and server manages the data and executes transactions. A collaborative system has a collection of servers and each of these can run transactions on local data; they cooperatively execute transactions that span various servers. Client server system can be easily implemented; it provides for a better utilization of expensive servers. Each module can be optimized thus reducing the volume of transferred data. If the functionality of server needs to overlap that of client, a client server system would need more complex clients whereas this is easily handled in collaborative systems since there is no distinction between the two. 5.15. What is GML? Why is it interesting to spatial databases and GIS? GML is Geography Markup Language. It is a modeling language for geographic information, and encoding for geographic information. It is designed for the web and web-based services. It is an OpenGIS implementation specification. GML is based on XML technologies, and implements concepts of the ISO 19100 series. It supports spatial and non-spatial properties of objects. It is extensible, open and vendorneutral. Characteristics of GML: 1) It supports the description of geospatial application schemas for information communities 2) enables the creation and maintenance of linked geographic application schemas and datasets 3) supports the transport and storage of application schemas and data sets 4) increases the ability of organizations to share geographic application schemas and the information they describe 5) leaves it to implementers to decide whether application schemas and datasets are stored in native GML or whether GML is used only for schema and data transport GML is interesting to spatial databases and GIS because it provides support for the geometry elements correcponding to the Point, Linestring, LinearRing, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollections. It also provides the coordinate element for encoding coordinates and the box element for defining spatial extents. We can use GML representation to build truly interoperable distributed GIS. 5.18 Assume that the system response time is the slowest response time of a single processor. Therefore, if one disk I/O is t and the other disk I/O is 2t, the system I/O is 2t. (1) Linear and CMD methods distribute 8 data items in one row into different 8 disks, so the row query speed-up is 8, using linear and CMD methods. Z-curve method distribute 8 data items in one row into 4 different disks with 2 items into 1 disk, so the row query speed-up is 4, using z-curve method. Hilbert method distributes 8 data items in one row into 7 or 8 different disks, so the row query speed-up is 4 or 8, using Hilbert method. I/O speed-up for lowest tow is 4. (2) Linear and CMD methods distribute 8 data items in one column into different 8 disks, so the column query speed-up is 8, using linear and CMD methods. Z-curve method distribute 8 data items in one column into 2 different disks with 4 items into 1 disk, so the column query speed-up is 2, using z-curve method. Hilbert method distributes 8 data items in one row into 5 or 6 different disks. Sometimes 2 items is in one disk, but sometime 3 items is in one disk. So the column query speed-up is 4 or 8/3, using Hilbert method. I/O speed-up for leftmost column is 4. (3) The speed-up for range query depends on the query range. In the bottom 4 rows of 2 leftmost columns, there are at most 2 cells in one disk for linear method, so the speed-up, using linear method is 4. There are at most 2 cells in one disk for CMD method, so the speed-up, using CMD method, is 4. There are at most 2 cells in one disk for z-curve method, so the speed-up, using z-curve method is 4. There are at most 1 cell in one disk for linear method, so the speed-up, using Hilbert method is 8. Table 1. gives the I/O speed-up for query 1, 2, and 3 in problem 5.18 1. Row query 2. Column query 3. Range query Linear 8 8 4 CMD 8 8 4 Z-Curve 4 2 4 Hilbert 4 4 8 Chapter 6 6.1 A unique aspect of spatial graphs is the Euclidian space in which they are embedded, which provides directional and metric properties. Thus nodes and edges have relationships such as left-of, right-of, north-of, and so on. An example is the restriction “no left turn” on an intersection in downtown area. Consider the problem of modeling road maps using the conventional graph model consisting of nodes and edges. A possible model may designate the road intersections to be the nodes, with road segments joining the intersections to the edges. Unfortunately, the turn restrictions (e.g., No Left Turn) are hard to model in this context. A routing algorithm (e.g., A*, Dijkstra) will consider all neighbors of a node and would not be able to observe turn restriction. One way to solve this problem is to add attributes to nodes and edges and modify routing algorithms to pay attention to those attributes. Another way to solve the problem is to redefine the nodes and edges so that the turn restrictions modeled within the graph semantic and routing algorithms are not modified. Propose alternative definitions of nodes and edges so that routing algorithms are not modified. One way to resolve this issue would be to assign very high costs to the nodes. As an example if going from node 1 to node 2 is undesirable (no left turn as an example) then cost assigned to 1->2 can be a very undesirable value to prevent the algorithm to choose that selection. Another approach would be to create a list of conflicting nodes. If pair-wise nodes are in the conflicting nodes list then algorithm would not choose them. 6.2 Study internet sites (e.g., mapquest.com) providing geographic services (e.g., routing). List spatial data types and operations needed to implement these geographic information services. Answer1: For this question, I surveyed MapQuest and YahooMap. These services use graph data types (i.e., graph, node, and edge), which are commonly used for network structures. The services offered by these sites focus on network analysis queries with specific emphasis on shortest path queries. I recognized two main variants on the shortest path query; minimum distance, and fastest path. Both might use the Bestfirst search algorithm (or one modified by edge weights). The minimum distance query returns the absolute shortest network distance. The fastest path query uses a weighted network. These weights might be assigned based on either speed limit or an edge (i.e., road) classification scheme (e.g, primary four-lane, primary two-lane, secondary, all the way down to jeep trail). The shortest path query with only the minimum distance constraint probably uses a single pair path computation, which returns only that route with the minimum distance between the start and stop vertices. Whereas the fastest path search might select the top k shortest paths, then considers the edge specific weights to find the optimal path between the start and stop vertices. This select and refine approach is similar to the minimum bounding box approach. MapQuest also offered a scenic route to the specified destination. This type of query could also fall under the shortest path; however, the algorithm used to select the scenic route finds a minimized path that touches the most predefined “scenic” vertices. This query would be performed with constraints that the scenic path is within a reasonable detour of the shortest or fastest path. Regardless of the type of query selected, the response is a set of directions that summarizes travel directions. Compiling these directions in a way that is useful to the user is not a trivial task. The directions must note: edge name changes, direction changes at vertices that are important to the navigator, and landmark points which might not be part of the original graph structure. This last operation requires a join query which might couple a point or polygon layer with the network layer and employ some of standard OGIS operations. Answer2: Internet sites providing geographic services (e.g. mapquest.com) has the following typical services. The right side of the table shows corresponding data types and operations. Services Data types Operations Street map, Geocoding map data (e.g., shape file) point query R-tree or Grid access OGIS operations (e.g., SpatialReference(), Boundary()) Direction (Routing) spatial network path query (e.g., vertex, edge, etc.) graph partitioning graph management operations (e.g., addEdge, getSuccessors) Proximity searching map data range query spatial network nearest neighbor search R-tree or Grid access distance scan graph partitioning 6.5 We learned about graph operators. Is this operator relevant to spatial databases with point, linear, or polygonal objects? Consider the GIS operation of reclassify, for example, given an elevation map, produce a map of mountainous regions (elevation > 600 m). Note that adjacent polygons within a mountain should be merged to remove common boundaries. Graph operators are very relevant to spatial databases. Where it gets difficult in its implementation, however, is that there are three different data types within these databases: points, lines, and polygons. Graph operators by definition are given by edges and vertices in which these edges are the connections between the vertices. Thus, it would seem only logical that graph operators would work well in spatial databases for certain data types. It almost seems as if these operators were custom designed for the combination of point and line data, with the points representing the vertices and the lines representing the edges. A good example of this would be the nodes of a road network, where the edges would be the different road segments and the points would be important steps along that network, such as an intersection with another edge. However, trying to apply these operators to more complex polygon data seems to be a bit more difficult. Because polygons are comprised of line segments, which are then comprised of points, it might be difficult to employ such operators to a polygon dataset. If one were to apply these polygon datasets to graph operators as is, one would assume that the polygons would take on the role of edges while its line segments make up its vertices. But, how can one define a polygon solely on the two vertices it (the edge) falls between? To me, it just does not seem possible to effectively employ graph operators to polygons as is. However, it we are defining elevation as being continuous across the polygon itself, it would be possible to define the elevation as a single point within the polygon, such as its geographic center. We would lose some of the geographic extent of these data with a conversion, but implementing this change would fit the data better to the graph operators. We could use the vertices to represent these elevation points while the edges would represent the generalized slopes between these elevation points. This would in effect create something very similar to what GIS users know as the triangular irregular network. I understand that taking this approach to the problem may demonstrate my lack of experience with these sorts of operators before this class and some of the difficulty I had in grasping some of these more abstract operators (I think my mind is trained on looking at things more geographically, which is why I see point elevations and slopes when I see a drawing of graph operators along a network). Here is an example map of what I describe above: 620m These points represent the geographic center of elevation for these polygons, assuming that elevation is uniform across the polygon. 540m 670m 620m 80 540m 50 130 670m The points now represent the vertices in G (V , E ) , while the slope lines represent the edges. The edges assume a distance between points of 1 unit. 6.6 Given a polygonal map of the political boundaries of the countries of the world, do the following: Classify the following queries into classes of transitive closure, shortest path, or other a) Find the minimum number of countries to be traveled through in going from Greece to China. Shortest Path b) Find all countries with a land-based route to the US. Other 6.7 Most of the discussion in this chapter focused on graph operations (e.g., transitive closure, shortest path) on spatial networks. Are OGIS topological and metric operations relevant to spatial networks? List a few point and OGIS topological and metric operations. Intersect and Distance can be used to find shortest path. Intersect, Touch, and Cross can be used to find transitive closure. 6.8 Propose a suitable graph representation of the River Network to precisely answer the following query: Which rivers can be polluted if there is a spill in a specific tributary? We should consider river segments instead of total rivers. The textbook’s initial graph representation of the River Network models the rivers as nodes and the “falls into another” relationship as directed edges (pages 151 and 157). This graph, however, will not help us answer the above query; instead, we must change the graph to make the confluences, heads, and mouths of rivers into nodes and the river segments into edges. Thus, if a given river segment is polluted, we can trace downstream to identify segments that may get polluted. Transforming this graph into an appropriate graph yields a result similar to the BART graph (page 156). The BART RouteStop table has columns ‘routenumber’, ‘stopid’, and ‘rank’. In the new River Network graph, riverID is equivalent to ‘routenumber’ and confluenceID is equivalent to ‘stopid’. We can also rank the confluenceIDs for a given riverID to make a rank column. The River table stays the same for this new graph representation. Instead of making the entire RiverConfluence table, we will draw the graph representing the nodes and directed edges (Figure 1). River riverID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Name Mississippi Ohio Missouri Red Arkansas Platte Yellowstone P1 P2 Y1 Y2 Colorado Green Gila G1 G2 Gl1 Gl2 1 2 7 3 8 32 31 4 9 14 5 10 11 6 12 13 15 30 29 16 22 19 20 28 17 24 21 23 25 26 18 27 Figure 1. New graph representation of River Network RiverIntersection RiverID IntersectionID 1 6 1 5 1 4 1 3 1 2 1 1 2 9 2 3 2 2 2 1 3 12 3 11 3 10 3 5 3 4 3 3 3 2 3 1 4 7 4 2 4 1 5 8 5 3 5 2 5 1 6 13 6 10 6 5 6 4 6 3 6 2 6 1 7 14 7 11 7 10 7 5 7 4 7 3 7 2 7 1 This table shows the river networks through the Yellowstone River. A complete table for all river segments will be very long. 6.9 Extend the set of pictograms discussed in Chapter2 for conceptual modeling of spatial networks. The differences between ER model and spatial network In spatial network each node represents one object but in ER model each node (entity) represents bunch of nodes (records) The relationships (connections) in spatial network are directed. There is no direction between entities in ER model. The node and connections can be represented by using pictograms. Bidirectional graph Unidirectional graph Node To represent node, we used circle whose inside is not full. The full circle is used to represent point value in ER model. 6.10 Produce denormalize node table similar to Figure 6.2(e) for the River Network example. Riverid Fallsinto 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Merge of 1 (2,3,4,5) 1 1 (6,7) 1 1 3 (8,9) 3 (10,11) 6 6 7 7 12 (13,14) 12 (15,16) 12 (17,18) 13 13 14 14 2 4 5 8 9 10 11 15 16 17 18 6.11 Extend the logical data model for spatial network by adding a sub-class “Path” for the class “Graph”. List a few useful operations and application domains. List a few other interesting sub-class of “Graph”. If we extended the Graph class with a Path class, we would be able to store multiple path information on a single network, not just the shortest path. In the bus example, we could keep a larger graph instead of having to create and store multiple sub-graphs that signify a single route. For this, each entry in the Path class would be independent and unconnected from each other. The above example makes some assumptions on how the graph is stored and what it represents. Perhaps a better example would be in a road map. The Path class helps in case the optimal shortest path has been altered, possibly by construction work. By using path entries we can quickly retrieve alternate routes. If we use the model as in the first example, our graph shows nodes from other routes along with a singular path. Then we might need an extension to show which nodes belong to which path and whether the routes they belong to intersect or in general, the overlapping of paths. 6.12 Turn restrictions are often attached to road networks, for example, left-turn from a residential road onto an intersecting highway is often not allowed. Propose a graph model for road networks to model turn restrictions. Identify nodes and edges in the graph. How will one compute the shortest path honoring turn restrictions? The road network model has the turn-restrictions. There are five lines and ten routes cross through the road network. All the lines converge in one point, such as downtown, and then spread out in sporadic directions. Use dijkstra’s algorithm to find the shortest path. It can be used to solve the single-source problem. It visits an immediate neighbor of a source code and its successive neighbors recursively before visiting other immediate neighbors. 6.13 (Group 8 did not answer this) 6.14 Compare and contrast Dijkstra's Algorithm with Best-First Algorithm for computing the shortest path. Dijkstra's algorithm finds the shortest path connecting all nodes possible starting with a specified source, while Best-First finds the shortest path leading from a specified source to a specified end point. Dijkstra's algorithm always gives (one of) the best spanning tree(s), while Best-First is heuristic, and while it usually gives the shortest path from A to B, it doesn't always. Dijkstra's algorithm links all nodes, Best-First doesn't always link all nodes ñ only those that it passes through while transversing from A to B. 6.15 What is a hierarchical algorithm for shortest path problem? When should it be used? Hierarchical algorithm partitions a large graph into fragment graphs and constructs a boundary graph for these fragments. These graphs would be much smaller than the original graph. The process continues on the boundary graphs till a graph of reasonable size (could mean a graph that fits in the main memory) is obtained. Proper construction of the boundary graph enable the shortest path query on the original graph to be decomposed into a set of shortest path queries on fragments without compromising the optimality of the path retrieved. The algorithm consists of three basic steps: Computing the relevant node pairs in the boundary graph: by finding the boundary graph nodes of the fragments of source node and destination nodes. Computing the boundary path: by finding the shortest path from the source node fragment to the destination node fragment. Expanding the boundary path: by expanding the path within the fragment. Hierarchical algorithm can be used if the graph is too large to fit in the main memory. It reduces the main memory requirements and the I/O costs. 6.16 Compare and contrast adjacency-list and adjacency-matrix representation for graphs. Create adjacency-list and adjacency-matrix representation for following graphs (i) Graph G in Fig 6.4 Adjacency-list representation: 1 2, 5 23 34 4 NULL 53 Adjacency-matrix representation: 1 2 3 4 5 1 0 1 0 0 1 2 0 0 1 0 0 3 0 0 0 1 0 4 0 0 0 0 0 5 0 0 1 0 0 (ii) Graph G* in Fig 6.4 Adjacency-list representation: 1 2, 3, 4, 5 2 3, 4 34 4 NULL 5 3, 4 Adjacency-matrix representation: 1 2 3 4 5 1 0 1 1 1 1 2 0 0 1 1 0 3 0 0 0 1 0 4 0 0 0 0 0 5 0 0 1 1 0 (iii) River network in Figure 6.3 Adjacency-list representation: 1 NULL 21 31 41 51 63 7 3 8 6 9 6 10 7 11 7 12 NULL 13 12 14 12 15 13 16 13 17 14 18 14 Adjacency-matrix representation: 1 2 3 4 5 6 7 8 9 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 0 0 0 5 1 0 0 0 0 0 0 0 0 6 0 0 1 0 0 0 0 0 0 7 0 0 1 0 0 0 0 0 0 8 0 0 0 0 0 1 0 0 0 9 0 0 0 0 0 1 0 0 0 10 0 0 0 0 0 0 1 0 0 10 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 11 12 13 14 15 16 17 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6.19 Revisit map-reclassify operation discussed in Section 2.1.5. Discuss if SQL3 Recursion and OGIS operations may be used to implement it. It may be possible to implement SQL3 recursion on spatial features that are recursive by the definition. Contain, inside, cover, covered-by may use SQL recursion to find object relations since these operations can progressively deepen. In the example given in 2.1.5 merge operation used to remove boundaries. This is very similar to recursion so SQL3 recursion may be used for this type of operations. 6.20 Concept Benefits Comple -xity breadth first search BFS algorithm visits all nodes that are reachable from v If the tuples of the edge relation are physically clustered according to the value of their source nodes, the majority of the members in the adjacency list are likely to be found on the same disk page O(V + E) Dijkstra’s algorithm Single-source shortest path problem solution This algorithm can be applied to a weighted, directed or undirected graph for the case where all edge weights are nonnegative. O(n2) best first search Use heuristic function to Underestimate the cost of two nodes The best-first search has been a framework to speed up other algorithms by using semantic information about a domain. Depends on heuristic function Breadth-first search finds shortest paths in graphs whose edges have unit length. Thus, if Dijkstra’s algorithm is applied to a graph with edges of unit length, Dijkstra’s algorithm reduces to breadth first search. If we replace heuristic function of best first search with identity function, best first search reduces to breadth first search. 6.21 Consider a spatial network representing road maps with road-intersections as nodes and roadsegments between adjacent intersections as edges. Coordinates of center points of road-segments are provided. In addition, stores and customer tables provide names, addresses and location of retailstores and customers. Explain how a shortest path algorithm may be used for the following network analysis problems: 1. For a given customer, find the nearest store using driving distance along road network. This might require multiple Best-first single pair computation. For each of these queries the v node is the consumer’s origin (e.g., work or home). The u node would be drawn from a list of stores that match the user’s specification. The algorithm could find the distance between v and the first u in the list of u’s. This would be set as the distance to beat. If a subsequent search finds a store that is closer along the network than the “distance to beat,” then that store and distance are set as the “distance to beat.” In this way, the search could continue through the list of u’s. Pruning could occur if the cumulative edge distance of a given u is longer than the “distance to beat.” 2. For a given store S and a set of customers, find the cheapest path to visit all customers starting from S and returning to S. This task might be considered using a local or global distance optimization. Using a local (or greedy) approach, I might use multiple Best-first single pair computations. The first search would find the closest customer to S. Then the nearest customer is set as the source and the distance is measured to the remaining customers. This would continue until all customers were visited. A global optimization might test every combination of routes that connect the customers then select the shortest. This however, would be very computationally expensive. 3. For each store determine the set of customers it should serve minimizing the total cost of serving all customers by using nearest stores. This is very similar to the first scenario; however, in this situation there are two lists of nodes. The source node is still the customer’s location, but the search will be repeated for each customer in the list. For instance, this might be abstracted as a nested for loop. The outer for loop is “for each customer find the nearest store.” The inner for loop checks the network distance to each store. The closest store adds the customer to its set. Then the loop continues through the customer list. In this way each customer will be assigned to a given store. This is also very computationally expensive, however it would only have to be performed periodically to update the stores’ sets. 4. Determine appropriate location for a new store to service customers farthest from current stores. I can imagine a brute force approach, which finds a global measure of total (or mean) network distance from customers to existing stores. Then place a new store node on the road network in the system and recalculate the global measure. Iteratively move the new store along major roads until the global measure of customer distance traveled is minimized. This approach is sort of like a k-means clustering algorithm, but we are trying to minimize within cluster distance and maximize among cluster distance (stores would represent cluster centroids). This approach could be optimized by computing distance from consumer to existing stores then using this information to find clusters or hot spots of customers that are furthest from stores. These hot spots would be places where we might initially place the new store. Then as before, move the new store node to try to minimize the global distance traveled. Chapter 7 7.1 (a) Set {a,b} {c} {a,b,c} Support 3 or 3/8 = 37.5% 6 or 6/8 = 75% 2 or 2/8 = 25% (b) Confidence for association rule {a,b} {c}: 2/3 = 66.7% (c) Confidence for association rule {c} {a,b}: 2/6 = 33.3% (d) Support for the association rule i1 i2 is based on the count (or proportion) of all transactions that contain both i1 and i2. In contrast, confidence for the association rule i1 i2 is the proportion of transactions containing i1 that also contain i2. Therefore, the support for i1 i2 and i2 i1 must be the same, but the confidence for the two rules will differ (as they do in (b) and (c) above) if the count of transactions containing i1 does not equal the count of transactions containing i2. (e) Note: I have found it impossible to complete the exercise given only the information provided in the table. First, the data are mathematically invalid. According to the table, 45 of 100 lakes are near forest and 30 lakes are both near forest and inside a state park. This means there must be 45 – 30 = 15 lakes that are near forest but not inside a state park. But the table also indicates that 90 of the 100 lakes are inside a state park, meaning that there can only be 10 lakes that are not inside a state park! So it is impossible that 15 lakes could be near forest but not inside a state park! Second, even if the data were valid, it would not be possible to determine how many lakes are both inside a state park and adjacent to federal land (as is necessary to complete the problem). To complete the problem, I will make two assumptions: 1) There are 75 lakes (not 90) inside a state park. 2) There are 35 lakes that are both inside a state park and adjacent to federal land. These assumptions mean the lakes can be summarized by the following diagram: near(X,forest 5 ) 20 20 inside(X,state_park) 5 10 10 25 5 adjacent(X,federal _land) Therefore, the support and confidence for the association rules given in the table are: Rule lake(X) near(X, forest) 45 Support Confidence 30 35 30/75 = 40% 35/75 = 46.7% 10 10/35 = 28.6% 45/100 = 45% lake(X) and inside(X, state_park) near(X, forest) lake(X) and inside(X, state_park) adjacent(X, federal_land) lake(X) and inside(X, state_park) and adjacent(X, federal_land) near(X,forest) None of these rules have both a support more than 30 and a confidence more than 70 percent. (f) For a particular point p that is in neither Cmo nor Cmn, the distance between p and the nearest medoid mi, d(p,mi), will be the same for both Mt and Mt+1 because the nearest m will be the same in both cases. (The only m that change between Mt and Mt+1 are mo and mn) Because there is no change in d(p,mi) for all such points p, only points that are in Cmo or Cmn must be fetched from main memory, and only the nonmedoid points need to be fetched because the locations of the medoids mo and mn (which are needed to compute d(p,mi) for all the nonmedoid points) should already be available from prior processing in the algorithm. (g) The time required to fetch only the nonmedoid points of Cmo and Cmn relative to the time required to fetch all points would be no more than 2/k (assuming that all clusters have the same size) because fetching all points would require fetching the points for k clusters instead of just 2 clusters. However, Cmo and Cmn may contain many of the same points if mo and mn are near to each other, in which case the optimal relative processing time could be as low as 1/k if the points in Cmo are almost the same set as the points in Cmn. 7.4 a.) Association rules are a kind of statistical correlation method. The major difference is in that ARs require a conditional probability test (X→Y) between like objects (items in a store) whereas statistical correlation is more interested in causal rules between different groups (big car = lousy miles-per-gallon). b.) It may be easier to see the distinction with respect to time. Autocorrelation implies that “things” are always related under a given circumstance, as a universal law in any data set. In a sense, it is an observation of itself that occurs periodically. Cross-correlation is the relation between two different sets of data within the same period interval. c.) In this, classification is only part of a complete solution with spatial data mining. In location prediction, we need an accurate classification model, but we also have to consider the relation of locality and spatial context to the formulation. d.) When defining a hot spot, it is commonly considered that the points (or data items) are relatively similar to a very high degree, which indicates their usefulness. A cluster is less extreme in this regard, whereas while it may have a number of related elements, the points can still be diverse within the cluster. e.) The goal of classification is to assign items (or points) to a certain pre-defined class. Clustering does not have the notion of a pre-defined class, and in fact, is used to help develop a classification scheme. f.) Associations demonstrate correlation between some item (or set of items) and another. By finding associations, we can define a better classification structure. Here, there is a measure of certainty that these items will always be correlated. In clustering, the goal is to minimize the difference between objects within the cluster and maximize the difference between clustered objects from other clusters. However, it still can be true that items may still be diverse (although slightly) within the cluster, and there isn’t a measure of certainty that these items will always be clustered 7.7 What is special about spatial statistics relative to statistics? One difference between spatial statistics and (regular non-spatial) statistics is that spatial statistics has more than one dimension to them. Another difference is that spatial data does not hold the assumption of independence property. Spatial data usually is highly self-correlated. This property that like things cluster together is space is considered the first law of geography (Everything is related to everything else, but nearby things are more related than distant things). 7.8 Which of the following spatial features show positive spatial auto-correlation? Why? (Is there a physical/scientific reason?) Spatial autocorrelation refers to the relationship among values of a variable attributable to the way that the corresponding area units are ordered in space. Positive spatial autocorrelation means that adjacent area units have similar values or characteristics, negative spatial autocorrelation means that nearby area units have dissimilar values or characteristics, and finally, no spatial autocorrelation means that there is no particular systematic structure on how the pattern is formed. Water content, temperature, soil type, and annual precipitation (rain, snow) have positive spatial auto correlation. Close by area tend to have similar water content, temperature, soil type, and annual precipitation. 7.9 Classify the following spatial point functions into classes of positive spatial autocorrelation, no spatial autocorrelation, and negative spatial autocorrleation. a) positive spatial autocorrelation b) negative spatial autocorrelation c) positive spatial autocorrelation d) no spatial autocorrelation 7.10 " The only data mining techniques one needs is linear regression, if features are selected carefully " Linear regression can be used for classifying linearly discreteable classes, if the features are selected carefully to simplify the model linear regression is enough. 7.11 Compute Moran’s I for the gray scale image shown in Figure 7.14(a), and matrices in Figure 7.6(b) and 7.6(c). Figure 7.6.b) I used mathlab to calculate I values z=[0 1 -4 -2 6 4 -5 -3 3] z= 0 1 -4 -2 6 4 -5 -3 3 >> w=[0 0.11 0 0.11 0 0 0 0 0 ; 0.11 0 0.11 0 0.11 0 0 0 0; 0 a 0 0 0 a 0 0 0; a 0 0 0 a 0 a 0 0; 0 a 0 a 0 a 0 a 0 ;0 0 a 0 a 0 0 0 a; 0 0 0 a 0 0 0 a 0 ; 0 0 0 0 a 0 a 0 a ; 0 0 0 0 0 a 0 a 0 ] ??? Undefined function or variable 'a'. >> a=0.11 -> normalized value in weighted matrix a= 0.1100 >> w=[0 0.11 0 0.11 0 0 0 0 0 ; 0.11 0 0.11 0 0.11 0 0 0 0; 0 a 0 0 0 a 0 0 0; a 0 0 0 a 0 a 0 0; 0 a 0 a 0 a 0 a 0 ;0 0 a 0 a 0 0 0 a; 0 0 0 a 0 0 0 a 0 ; 0 0 0 0 a 0 a 0 a ; 0 0 0 0 0 a 0 a 0 ] w= 0 0.1100 0 0.1100 0 0 0 0 0 0.1100 0 0.1100 0 0.1100 0 0 0 0 0 0.1100 0 0 0 0.1100 0 0 0 0.1100 0 0 0 0.1100 0 0.1100 0 0 0 0.1100 0 0.1100 0 0.1100 0 0.1100 0 0 0 0.1100 0 0.1100 0 0 0 0.1100 0 0 0 0.1100 0 0 0 0.1100 0 0 0 0 0 0.1100 0 0.1100 0 0.1100 0 0 0 0 0 0.1100 0 0.1100 0 >> I=(z*w*z')/(z*z') I= 0.0152 Figure 7.6.c) >> z=[ -2 4 6 -4 0 3 -5 -3 1] z= -2 4 6 -4 0 3 -5 -3 1 >> >> w=[0 a 0 a 0 0 0 0 0 ; a 0 a 0 a 0 0 0 0 ;0 a 0 0 0 a 0 0 0 ; a 0 0 0 a 0 a 0 0 ; 0 a 0 a 0 a 0 a 0 ; 0 0 a 0 a 0 0 0 a ; 0 0 0 a 0 0 a 0 0 ; 0 0 0 0 a 0 a 0 a ; 0 0 0 0 0 a 0 a 0] w= 0 0.1100 0 0.1100 0 0 0 0 0 0.1100 0 0.1100 0 0.1100 0 0 0 0 0 0.1100 0 0 0 0.1100 0 0 0 0.1100 0 0 0 0.1100 0 0.1100 0 0 0 0.1100 0 0.1100 0 0.1100 0 0.1100 0 0 0 0.1100 0 0.1100 0 0 0 0.1100 0 0 0 0.1100 0 0 0.1100 0 0 0 0 0 0 0.1100 0 0.1100 0 0.1100 0 0 0 0 0 0.1100 0 0.1100 0 >> I=(z*w*z')/(z*z') I= 0.1555 Figure 7.14.a) >> z=[-6.125 -4.125 -1.125 -.125 -2.125 -8.125 3.875 3.875 3.875 -3.125 -11.125 4.875 -1.125 1.875 11.875 6.875 ] z= Columns 1 through 13 -6.1250 -4.1250 -1.1250 -0.1250 -2.1250 -8.1250 4.8750 -1.1250 3.8750 3.8750 3.8750 -3.1250 -11.1250 Columns 14 through 16 1.8750 11.8750 6.8750 >> a=.0625 a= 0.0625 >> w=[0 a 0 0 a 0 0 0 0 0 0 0 0 0 0 0 ; a 0 0 0 0 a 0 0 0 0 0 a 0 0 0 0 ; 0 a 0 a 0 0 a 0 0 0 0 0 0 0 0 0 ; 0 0 a 0 000a00000000;a0000a00a0000000;0a00a0a00a000000;00a00a0a00a0 0000;000a00a0000a0000;0000a0000a00a000;00000a00a0a00a0 0;000 000a00a0000a0;0000000a00a0000a;00000000a0000a00;000000000a0 0a0a0;0000000000a00a0a;00000000000a00a0] w= Columns 1 through 13 0 0.0625 0 0 0.0625 0 0 0 0 0 0 0 0 0.0625 0 0 0 0 0.0625 0 0 0 0 0 0.0625 0 0 0.0625 0 0.0625 0 0 0.0625 0 0 0 0 0 0 0 0 0.0625 0 0 0 0 0.0625 0 0 0 0 0 0.0625 0 0 0 0 0.0625 0 0 0.0625 0 0 0 0 0 0.0625 0 0 0.0625 0 0.0625 0 0 0.0625 0 0 0 0 0 0.0625 0 0 0.0625 0 0.0625 0 0 0.0625 0 0 0 0 0 0.0625 0 0 0.0625 0 0 0 0 0.0625 0 0 0 0 0 0.0625 0 0 0 0 0.0625 0 0 0.0625 0 0 0 0 0 0.0625 0 0 0.0625 0 0.0625 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0625 0 0 0.0625 0 0 0 0 0.0625 0 0 0.0625 0 0 0 0 0.0625 0 0 0 0 0 0 0 0.0625 0 0 0.0625 0 0 0 0 0.0625 0 0 0 0 0 0 0 0.0625 0 Columns 14 through 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0625 0 0 0 0.0625 0 0 0 0.0625 0.0625 0 0 0 0.0625 0 0.0625 0 0.0625 0 0.0625 0 >> I=(z*w*z')/(z*z') I= 0.0100 7.12 Compare and contrast the concepts in the following pairs: i. Spatial Outliers vs. Global Outliers A global outlier defines inconsistencies in relation to the whole of the data whereas spatial outliers are concerned with a localized phenomenon. For example, consistently inconsistent readings from a sensor compared to all other sensors is a case of a global outlier (and probably indicates sensor failure). A spatial outlier is a strange “pocket” of inconsistent readings that were not discovered at any other location. ii. SAR vs. MRF Bayesian Classifiers The primary difference between the two is that SAR (Spatial Autoregressive Regression) treats object classes as independent of each other. In MRF, class labels of neighbors are considered to be related to the definition of the object. For both, predictions are localized to a strict neighborhood. iii. Co-locations vs. spatial association rules Spatial association rules need not consider locality to other objects when defining a relationship. In fact, they do not consider individual items at all, and instead tries to categorize behavior across the whole of the data set. Co-location rules attempt to find common pairs (or groups) of items that repeatedly appear together within some defined distance. Nothing needs to be known about the data set, other than that items are distinct and identifiable, whereas in spatial association, some knowledge of what exists in the data must be known to create effective rules. iv. Classification accuracy vs. spatial accuracy Both are intended to determine the effectiveness of classification. The first tests what percentage of all objects was correctly classified by the algorithm. However, when space is discretized for classification, we need to know how accurate the predictions are based on their actual locations. Hence, the need for a test of spatial accuracy. 7.13 Spatial data can be mined either via custom methods (e.g., collocations, spatial outliers detection, SAR) or via classical methods (e.g., association rules, global outliers, regression) after selecting relevant spatial features. Compare and contrast these approaches. Where would you use each approach? Global outliers is a spatially referenced object whose values are significantly different from those of other spatially referenced objects in its spatial neighborhood. A spatial outlier is a local instability to its neighbors. Spatial association rules are defined in terms of spatial predicates rather than items. It was designed for categorical attributes. Colocation rules attempt to generalize association rules to point collection data sets that are indexed by space. SAR is spatial autoregressive regression. SAR model is similar to the linear logistic model in terms of the transformed feature space. In other words, the SAR model assumes the linear separability of classes in transformed feature space. 7.14 (Group 8 did not answer this) 7.15 Identify two clusters identified by k-medoid algorithm for the following set of pts: (i) (0,0) (1,1) (1,0) (0,1) (6,6) (7,6) (6,7) (7,7) (ii) (0,0)(1,1)(1,0)(0,1) clearly it would be (1,1) and (6,6) most likely it would be (0,0) and (1,1) 7.16 (Group 9 did not answer this) 7.17 Define scale-dependent and scale-independent spatial patterns. Provide examples. Scale-dependent spatial phenomena are those phenomena whose spatial patterns vary at different scale from observers’ point of view. For example, at local scale (say, neighborhood level), the streets appear to be regular grid. However, at city level, the streets network might no longer appear as regular grid any more. Most of spatial patterns are scale-dependent. Actually in geography community, it is very important to situate one’s research at specific scale, and realize that most geography phenomena are scale-dependent. On the contrary, scale-independent spatial phenomena do not show different spatial patterns at different geographic scale. For example, no matter in which scale, fish live in the water, not on the land. So you can’t go fishing in the land area that does not have any water. 7.21 Consider the problem of detecting multiple spatial outliers. Discuss how one should extend the techniques discussed in Section 7.6. Note that a true spatial outlier can make its neighbor be flagged as spatial outliers by some of the tests discussed in Section 7.6. Given the techniques described in Section 7.6, the detection of multiple spatial outliers can become less than precise because of the outliers’ influence on neighboring values. Thus, in order to effectively check for multiple spatial outliers in a dataset, it is imperative to account for this neighborhood problem. One more computationally intense method that could detect these outliers is finding one outlier per calculation of the scatterplot, Moran scatterplot, or S(x). Once the value that is detected and isolated, one can remove that value from the dataset and run these calculations once again and determine the next outlying data point from the set. This could continue until none of the data points fall under a certain threshold to determine whether it is an outlier. Another method would be an extension of this calculation that might simplify based on the number of iterations but may introduce more opportunity for error. From the first calculations, find all of the spatial outliers that are not neighbors of another spatial outlier. If there are multiple outliers next to each other, have the algorithm select the point that is most different within that population. Isolate these values, and perform new calculations minus these values in the same manner as above. If there are still outliers, follow the same isolation process and create new plots until there are no more determined outliers. Again, this might introduce a bit more error into the calculations, but this could only be determined through testing of the system. Chapter 8 8.1 8/3 11/5 10/5 5/3 11/5 15/8 17/8 11/5 13/5 15/8 23/8 14/5 7/3 14/5 14/5 11/3 8.2 In order to compute Rhp = R – Rlp, we must first compute Rlp as shown in problem 8.1: R 1 2 3 3 2 4 1 3 2 2 2 5 1 1 4 4 2.25 2.16 Rlp 2.67 2.16 2.11 2.77 2.83 2 2.11 2.88 3.16 1.5 2 3 3.75 2.5 Now we can compute Rhp: R Rhp -.33 1 2 3 3 -1.25 -.16 2 4 1 3 -.16 1.89 -1.77 .17 2 2 2 5 0 -.11 -.88 1.89 1 1 4 4 -.5 -1 1 .25 8.3. 1 2 3 2 4 1 2 2 2 1 1 4 Nitrogen Content A .5 3 3 5 4 A B B B B B C C A C C C A Soil Map The raster maps shown above represent the nitrogen content and soil type at each pixel. Calculate the average nitrogen content of each soil type and classify the operation as either local, focal, zonal, or global. Operation is zonal since value of a cell in the new raster is a function of the value of that cell in the original layer and the values of other cells which appear in the same zone specified in Soil map raster. Average value of each nitrogen content for each soil type will be as follow: For A: (1+2+5+4)/4 = 3 For B: (3+3+4+1+3)/5 = 2.8 For C: (2+2+1+1+4)/5 = 2 Raster map for this operation will be as follows: 3 3 2.8 2.8 2.8 2.8 2.8 2 2 3 2 2 2 3 Average nitrogen content 8.4 The raster layer R shows the location of households (H) and banks (B). Assume each household does business with both the banks. A survey of the last thirty visits by members of the household was conducted which showed that a household interacted with the nearest band 60 percent of the time. What was the averaged distance that members of the household traveled to banks in the last thirty days? Assume a distance of one to horizontal and vertical neighbors, and 1.4 to diagonal neighbors. For this question, I assumed that each household visited the bank 30 times (i.e., once a day for 30 days). Further, as stated in the question, households preferentially visited the closer banks (i.e., 60 percent of visits to the closest bank) and that diagonal distance is 1.4 and edge distance is 1. To answer this question I made a distance grid for both banks then used these to calculate the averaged distance grid. Example calculation: for the top left cell in the averaged distance traveled grid the calculation is (18*2 + 12*3.4)/30 = 2.56. 8.7 (p.248)The vector versus raster dichotomy can be compared with the “what” and “where” in natural language. Discuss. Raster is defined by a grid of pixels, each pixel is a different color to make an entire image. To display a Raster graphics, we need to define “what” color each pixel should have. Once a raster image has been acquired, a geo-reference is applied. This is the process of relating the grid positions of the pixels to their corresponding latitude and longitudes. In this way, a computer can relate pixel position to latitude and longitude. However, the system has no knowledge of the details (such as the coast line) in the raster images it displays. Vector graphics, on the other hand, are not defined by pixels and are not constricted to a grid format. Vector graphics are given instructions by the computer about how the objects should be shaped and their relative size. To display a Vector graphics, the computer need to know “where” the object need to be displayed. 8.8 Is topology implicit or explicit in raster DBMS? Topology is explicit in raster DBMS. 8.9 Compute the storage-space needed to store a raster representation of a 20km x 20km rectangular area at a resolution of: (a) 10m x 10m pixels need 4×106 pixels (b) 1m x 1m pixels. need 4×108 pixels (c) 1cm x 1cm pixels. need 4×1012 pixels 8.10 Study the raster data model proposed by OGIS employed by Idrisi Kilimanjaro, a raster-based GIS software package developed by Clark Labs. Compare it with the map algebra model discussed in this chapter. As with most other GIS software packages, Idrisi Kilimanjaro utilizes the map algebra model for most of its raster-based analyses. This historically has proven to be the easiest and most straightforward method of overlay among raster data sets. Idrisi is among the more powerful of the raster-based GIS software systems. It can perform simple overlays to combine multiple datasets into new data using simple mathematics, such as addition and normalized differencing. It can perform logarithmic and power transformations and can differentiate between the four different operations (local, focal, zonal, and global) fairly well. In performing neighborhood analyses, such as slope and aspect determination and even the more difficult interpolation techniques (inverse distance weighting, kriging, etc.), Idrisi can determine the extent of its neighborhood and determine to which of the four different operations it belongs. One reason that OGIS may not have made a public recommendation for the raster data model just yet is that the map algebra model is so dominant within the GIS software community already. Maybe OGIS is still trying to find another approach to combining and overlaying raster data, but to this point nothing appears to be as effective as the map algebra approach. It will be interesting to see what OGIS finally does decide to recommend for the raster data model and how much it differs, if any, from the already widely-used map algebra model. 8.11 Define the “closure” property of algebra. Evaluate OGIS model for vector spatial data for closure. The closure property of algebra: If an operation is applied to the two elements of a domain, the result must be in the same domain to satisfy the closure property. The geometry collection spatial data provide a ‘closure’ property to OGIS spatial data types under geometric operations such as ‘geometric-difference’ or ‘geometric-intersection’ (textbook p.26) 8.12 What is special about data warehousing for geospatial applications (e.g. census)? A. The data used contain both spatial and nonspatial attributes. Result shown needs to be presented by maps contrast to tables or text used by traditional data warehousing methods. Visualization of data with spatial representations is a very important for geospatial applications. Geospatial applications do need to spatial aggregates but this is very difficult to implement. 8.13 Compare and contrast: a.) Measures vs. Dimensions Dimensions are independent variables that uniquely identify measures (dependant variables). A quick example of this is that in a school database, your social security number or student id (dimension) can be used to identify what college of study you belong to (measure), and not the other way around. b.) Data Mining vs. Data Warehousing Data Ming can be seen as the application of methods and algorithms to find patterns in the data. This, in turn, allows for better prediction of future behavior. In Data Warehousing, the process is more deductive by finding hypothetical patterns in historical data. In a sense, Data Mining is concerned with modeling of the data, wheras Data Warehousing provides the actual data and helps to inform the DM algorithms of what would be a reasonable pattern to test for. c.) Rollup vs. Drilldown A rollup on a given dimension or set of dimensions is an aggregation of the data into a smaller set of tuples. These tuples can be seen as an abstraction of the original set of data by reducing the level of detail considered. A drilldown operation does the exact opposite of rollup. d.) SQL2 Group By vs. Group By with CUBE With the group by clause, you can specify which dimensions you want to examine (along with defining them in the select clause). This translates to a rollup on the dimensions not specified. The CUBE operator is a generalization of this procedure and provides a tabular, combinatorial view from the specified Ndimensions. Thus, the regular group by aggregation is done once, but the CUBE operation shows all possible aggregations beyond the first one specified. e.) CUBE vs. Rollup The differences between these has been alluded to as above. Basically that the rollup operation provides a single aggregation operation and that CUBE shows a generalized version with aggregations shown after the specified aggregation. 8.14 Classify following aggregate spatial functions into distributive, algebraic, and holistic functions: mean, medoid, autocorrelation (Moran’s I), minimum bounding orthogonal rectangle and minimum bounding circle, minimum bounding convex polygon, spatial outlier detection tests in section 7.4. Distributive: minimum bounding orthogonal rectangle and minimum bounding circle, minimum bounding convex polygon, Algebraic: mean, Moran’s I Holistic: medoid, spatial outlier detection tests 8.15 (Group 8 did not answer this) 8.16 Some researchers in content based retrieval consider ARGs to be similar to the Entity Relationship Models. Do you agree? Explain with a simple example. No, ER modeling does not interact with queries in the same manner as an ARG. If I query a database of bank transactions modeled by an ER diagram, there is no parallel to the feature-space points which provide ARGs their value. 8.17 Can ARG model raster images with amorphous phenomenon. ARGs are completely connected graphs that represent objects and their relationships. This seems naturally suitable to model discrete phenomenon. Using this model to represent amorphous phenomenon could be inaccurate. ARG requires specification of objects(discrete) within images and the relationships are specified between the objects. For amorphous phenomenon, this encapsulation can lead to errors. But, if we are interested in the values at certain points the model would work. Capturing the continuous nature of these phenomenon might not be fully possible if we use ARGs. 8.18 Any arbitrary pair of spatial objects would always have some spatial relationship, e.g., distance. This can lead to clique like ARGs with large number of edges to model all spatial relationships. Suggest ways to sparsify ARGs for efficient content based retrievals. We could think of several ways to sparsify ARGs for efficient content based retrieval. 1) Method 1: Since spatial objects usually have stronger connections with nearby objects, and have weaker connections with objects far away. So we can use a distance threshold as filter. For objects at a distance greater than that threshold, we don’t need to model their relationship using ARGs. 2) Method 2: Instead of using distance as a filter, we can also use a specific number n as the filter. We only use ARGs to model the relationship between a spatial object and its n nearest neighbors. 3) Method 3: First, we use clustering algorithm to divide all spatial objects into m clusters. Then we use traditional ARGs to model spatial relationship for objects within each cluster. For objects belong to different clusters, we will not directly using ARGs to model their spatial relationship. Instead, we will use the centroid of each cluster as the representative of each cluster. Then use ARGs to model spatial relationship of all these representatives (centroids). 8.21 (p.249) Explore a spreadsheet software, for example, MS excel, to find pivot operation. Create an example data-set and illustrate the effect of pivot operation. Mary 7 Formula 5 Mary 9 Beer 10 Jenny 12 Bread 3 Kate 15 Beer 8 Mike 20 Beer 10 MIke 20 Bread 1 Sum of amount purchase by each person Sum of Amount Name Jenny Date 5 Purchase Formula 5 Total 12 Bread 1 Beer 15 Beer 12 Total Jenny Total Kate 1 Total 15 Total Kate Total Mary 1 Diaper 7 Diaper Formula 9 Beer 1 Beer 20 Beer Bread 1 Total 7 Total 9 Total Mary Total Mike 1 Total 20 Total Mike Total Grand Total Total 10 10 3 3 13 7 7 8 8 15 5 5 5 5 10 10 10 25 9 9 10 1 11 20 73 Sum of purchase by item Sum of Amount Purchase Beer Name Kate Kate Total Mary Mary Total Mike Date 1 15 9 1 20 Mike Total Beer Total Bread Jenny Jenny Total Mike Mike Total Bread Total Diaper Mary Mary Total Diaper Total Formula Jenny Jenny Total Mary Mary Total Formular Total Grand Total 12 20 1 7 5 7 Total 7 8 15 10 10 9 10 19 44 3 3 1 1 4 5 5 10 10 10 10 5 5 15 73 Sum of purchase by date Sum of Amount Date 1 Purchase Beer Beer Total Diaper Diaper Total Name Kate Mike Mary 1 Total 5 Formula Jenny Formula Total 7 Diaper Mary Diaper Total Formula Mary Formula Total 9 Beer Beer Total Mary Bread Bread Total Jenny Beer Beer Total Kate Beer Beer Total Bread Bread Total Mike 5 Total 7 Total 9 Total 12 12 Total 15 15 Total 20 20 Total Grand Total Sum of all purchase amount Sum of Amount Name Jenny Kate Mary Mike Grand Total Total 13 15 25 20 73 Mike Total 7 9 16 5 5 21 10 10 10 5 5 5 5 10 10 10 10 3 3 3 8 8 8 10 10 1 1 11 73 8. 22. Study the census data-set and model it as a data warehouse. List the dimensions, measures, aggregation operations, and hierarchies. Using the NHGIS dataset as an example. The examples I list here only limit to demographic information. Since demographic data does not make too much sense if grouped by year (for example, it does not make sense to sum population of different years). So we need to create data cubes for different census years separately. The following example only refers to data cubes for a specific census year. Dimensions are as follows: 1) Location (which has a hierarchical structure: RegionStateCountyTractBlockGroupBlock) 2) Sex 3) AgeGroup 4) Race 5) Origin 6) MarriageStatus Measure: Population Since the location dimension has a hierarchical structure, the whole dataset must be structured as a lattice of data cubes, where each cube is defined by the combination of a level of detail for each dimension. Data abstraction in this model means choosing a meaningful summary of the data. Choosing a data abstraction corresponds to choosing a particular projection in this lattice of data cubes: (a) which dimensions we currently consider relevant and (b) the appropriate level of detail for each relevant dimensional hierarchy. Specifying the level of detail identifies the cube in the lattice, while the relevant dimensions identify which projection (from n dimensions down to the number of relevant dimensions) of that cube is needed. Region _Sex_AgeGroup _Race_Origin _M arri ageSt atu s Least Det ai led St ate_Sex_AgeGroup _Race_Ori gin _M arri ageSt atu s Co un ty_Sex_AgeGro up_Race_Ori gi n _MarriageStat us Tract_Sex_AgeGroup _Race_Origin _M arri ageSt atu s BlockGroup _Sex_AgeGrou p_Race_Ori gi n _MarriageStat us Block_Sex_AgeGrou p_Race_Ori gi n _MarriageStat us M ost Det ai l ed Figure 1 The Lattice of Data Cubes (hierarchy defined by location) 1_2_3_4_5_6 6-D data cube 5-D data cube 1_2_4_5_6 1_2_3_5_6 1_2_3_4_6 1_2_3_4_5 1_3_4_5_6 2_3_4_5_6 4-D data cube 1_2_3_4 1_2_3_5 1_2_3_6 1_2_4_5 1_2_4_6 1_2_5_6 1_3_4_5 1_3_4_6 1_3_5_6 1_4_5_6 2_3_4_5 2_3_4_6 2_3_5_6 2_4_5_6 3_4_5_6 3-D data cube 1_2_3 1_2_4 1_2_5 1_2_6 1_3_4 1_3_5 1_3_6 1_4_5 1_4_6 1_5_6 2_3_4 2_3_5 2_3_6 2_4_5 2_4_6 2_5_6 3_4_5 3_4_6 3_5_6 4_5_6 1_2 1_3 1_4 1_5 1_6 2_3 2_4 2_5 2_6 3_4 3_5 3_6 4_5 4_6 5_6 2-D data cube 1-D data cube 0-D data cube 1 2 LEGEND 1: Location 2: Sex 3: AgeGroup 4: Race 5: Origin 6: MarriageStatus 3 4 5 6 Aggregate: Sum ALL For Example: 1_2_3_4_5 Represent Cube-Query SELECT Location, Sex, AgeGroup, Race, Origin, ALL Sum(Population) AS Population FROM POPULATION GROUP BY CUBE Location, Sex, AgeGroup, Race, Origin Figure 2: The 0-D, 1-D, 2-D, 3-D, 4-D, 5-D, and 6-D data cubes In this example, we can use aggregation operations such as: Min., Max. (eg, minimum population of state), Sum (e.g., Sum the population of all states), Average (Average population of all states). Basic cube operators such as Roll-up, Drill-down, Slice and dice, Pivoting, can also be used based on the above aggregation hierarchy. NEW QUESTIONS Design and answer 3 new questions on selected sections in the book. Question 1 Chapter 1: Design a query that could be either spatial or nonspatial and explain. “Find the top salesperson in each region” This query depends on how the word region is defined. This query can be nonspatial if the regions are static. However if the regions are defined differently the query could be spatial. The MN regions of Minneapolis, St. Cloud, Rochester, and Duluth’s boundaries may change often. Question 2 Chapter 8.1.1 For the following neighborhood find the: 6 7 -3 9 3 4 -6 2 5 0 3 4 0 1 -4 -6 a) FocalSum using the Rook scheme b) FocalSum using the Bishop scheme c) FocalSum using the Queen scheme 22 12 5 -3 a) 9 19 0 2 13 25 7 9 8 10 10 6 -2 1 -4 -6 1 9 6 2 4 3 4 -1 12 15 12 4 2 3 4 -1 Rook 20 5 19 2 b) Bishop 25 21 11 -1 26 27 24 8 c) Queen Question 3 Chapter 2.1.4 Explain why the following nine-intersection model will not work for two dimensions: 1 1 0 1 1 0 0 0 0 What this is saying is that the interior and boundary of A and B touch each other, however the exteriors do not meet. So if the interiors and boundaries are the same it leads to two possiblities either the objects overlap each other at some point or the objects overlap at every point (are totally identical). However since the exteriors do not meet this is impossible. The nine intersection model for overlap contains a ‘1’ in all nine positions. Mid-term Exam. Solutions (Csci 8715, Spring 2004) 1. Currently literature survey via google.com, though useful, is not adequate. It does not cover a large fraction of the web, can not search many databases, and does not have access to either older documents or tacit (i.e. not yet codified) knowledge. It addition, it is non-trivial for users to identify set of all keywords relevant to the topic of the search. Thus researchers should complement use of google.com with some of the following: (a) Search relevant databases, e.g. DBLP, citeseer, ACM DL, IEEE DL, amazon.com etc. (b) Visit libraries to search for older documents (c) Contact domain experts to for help in identifying relevant papers, keywords, people and tacit knowledge 2. Use pictogram grammar in chapter 2 to design pictograms for (a) linestring or polygon (b) point (c) add a new entity road-network, a new relationship part-of and add pictograms (d) point 3. Parts (a) River originates in a country has two cases. The river may end within the same country without ever leaving the boundaries. This case is modeled using inside or within. Alternative the river may flow outside the country of origin. This will be modeled by “cross”. Note that overlap is not appropriate unless river is a polygon. (b) Topological relationship between Vatican and Italy seemed to confuse many students. Note that the interior of Italy will have a hole to account for the interior of Vatican. Thus the interiors of Italy and Vatican do not intersect. The relationship is modeled by touch given the OGIS topological predicate. Note that this example shows a limitation of OGIS model, which does not model special spatial relationships among polygons with holes. (c) Partitions require a collection of OGIS relationships. Forest-stands are pair-wise disjoint. Each forest-stand is covered by its forest. Finally the union of all foreststands should cover (or equal) the forest. 4. Most of the students did fine with this questions. A few made the mistake of duplicating nodes/leafs across the siblings in a R-tree. Recall that the MOBRs of siblings may overlap in a R-tree but do not. R+tree allows replication of data items to eliminate the overlap between MOBRs of siblings. I will use nested parenthesis to denote the trees: (a) R-tree: (root <1 (A (a b c)) (B (d e f)) > <2 (C (g h) ) (D (I j)) (E (k l)) > ) (b) Search path: root – 1 – A – (c) – B – d – e – 2 – C – g – h – D – I 5. Most of the students did fine, but a few tried to argue that HEPV is always better than FPV. Complete domination of a method by another is quite rare in systems research area. Usually each method dominates in some area. This is case here. HEPV reduces cost of updates and storage costs, however FPV is faster in computing shortest paths.