Chapter 2 Data Structures VECTOR DATA STRUCTURES There are numerous spatial data structures or data models used in GIS software. Figure 1.1 diagrams the relationships between common data models. We are primarily concerned with two data structures: coverages and shapefiles. There are two aspects of the structures: the folder and file structure, that is, how the files that contain the data, and the actual data model or the organization of both the spatial and non-spatial or attribute data. Folder and File structure There are two primary spatial data file structures we need to discuss: the disk structure of the two file types we are going to be using in this course. The two structures are coverages and shapefiles. It is very important that you understand the differences in how these data files are stored on the disk because incorrect management is a major problem for beginning users of these files. Although we are going to be using only coverages and shapefiles and several kinds of attribute data files there are many other forms of GIS data as outlined in Figure 1.1. The vector data model, as described in Chapter 1, is based on points and their X,Y coordinate pairs. These points are used to construct the other feature types (lines and polygon). Under Vector models, the figure shows topological and non-topological models. Shapefiles are considered to be nonGI Spatial Data Attribute Data Access Access Vector Raster DBase DBase Grid NonTopological Shapefile Shapefile Other DBs Topological High level Data Models Simple Data Coverage Coverage IDRISI TIN GeoDataBase GeoDataBase Regions Dynamic segmentation Object Oriented Figure 1.1. A schematic of data models. After Chang, 2002. Chapter 2 1 topological. It might be better to say that shapefiles are designed so that topology can be constructed on the fly and is not built into the data. As a result, shapefiles load and draw faster then coverages. Coverages are the topological data model spatial data constructs that we will be using. TINs are triangulated irregular networks used to represent continuous data like elevation and pollution concentrations. TINs are made up of triangles whose edges connect three points defining, say, elevation at spots. The slope and aspect of the triangular polygons can be calculated easily and other raster type of operations can be carried out with this data structure. Regions are an advanced polygon structure in which the polygons may overlap and the logical polygon may be made up several unconnected graphic polygons. Dynamic segmentation is a model based on the line or arc feature type and allows for the assignment of different attributes to different parts of an arc. The GeoDataBase is a complex structure that uses point, line, and polygon geometries to represent the graphic part of GI. Point features can consist of a single point or a set of points. A line feature is made up of a set of arcs that are not necessarily connected. A polygon feature is made up of one or many rings that may or not be connected. While with the other structures the attribute data was stored separately from the spatial data that is not the case with the GeoDatabase. Both spatial and attribute data are stored in the same database. The way in which coverage and shapefile data is stored is quite different and can cause considerable confusion in novice users. We will take up the file structure of these data model type separately. Coverages A coverage is a topological data structure (see later for definition of topological). On disk, a coverage is defined by the data stored in a folder. Figure 2.2 shows the coverages for a database called Yellowstone. Note that the Yellowstone database includes a folder called info as well as coverages various versions of land use. The info folder is important since that is where the data is actually stored but the really important point from a data management point of view is that any folder of GIS data that contains an info folder is a workspace. Any folders inside a workspace, that is, any folder containing an info folder, cannot be moved by dragging or copying and The1 files inside the coverages actually contain only pointers to data in the info folder so moving them in explorer or in DOS will cause destruction of the coverage! You must use the data management tools in the ESRI softwares. In Figure 2.2, the coverage is called Folders inside a workspace, that is, any folder containing an INFO folder, cannot be moved by dragging or copying and pasting! Chapter 2 Figure 2.2 Windows Explorer view of the database called Yellowstone. 2 INLANBUF and it is and it is located in a workspace called LANDED although you can’t tell that from the figure. You can tell it is a workspace from the fact that there is an INFO file. The contents of some of the files in the coverage INLANBUF are pat.adf – polygon or point attribute table. contains information about the polygons (or points if it is a point coverage), aat.adf – not in figure 2.2 but if present contains the arc attribute table. tic.adf – Contains the tic coordinates for the coverage. Tics locate the coverage in space bnd.adf – Contains the coordinates of the current bounding rectangle for the coverage. The bounding rectangle is also called the extent of the data prj.adf – Contains projection, datum, and units information about the coverage arc.adf – Contains information about the arcs in the coverage If the coverage was a PC ArcInfo coverage then the files with names like pat.dbf and tic.dbf would be, respectively, the actual polygon and tic database files and NOT pointers to data in an info file. Since the files in workstation ArcInfo have an .adf extension you know they contain only pointers to the data stored in the info folder. Coverages produced and used by PC ArcInfo do NOT have the info folder structure and thus may be moved or copied and pasted in windows explorer or DOS. Make sure you understand the points made above about the folder and file structure of coverages. If you don’t understand it you will undoubtedly make frustrating if not serious errors in the management of coverages. Shapefiles Shapefiles do not have the same kind of topology as coverages. The graphic features in shapefiles are just that, graphic objects and, as you will see, the kinds of analyses that can be performed on shapefiles is quite different than Figure 2.3. Contents of a simple the analyses that can be carried out with shapefile. coverages. Figure 2.3 shows the data files that make up a shapefiles called AccessRd. The file with the .dbf extension is the attribute data file, the .shp file stores the geometry and the .shx file stores the index of the of the feature geometry. These are the three files that must be present for a working shapefile. Other files that may be in a shapefile are the .sbn and .sbx files that store the spatial index of the features and the .fbn and .fbx that store the spatial index of the features for shapefiles that are read-only. There may also be .ain and .aih files store the attribute index of the active fields in a table’s or a theme’s attribute table, the .xml file that stores metadata for use in ArcInfo 8, and the .avl file that stores legend information. For now the important point is that if you move a shape file in DOS or Windows Explorer you have to make sure you get all the parts. Chapter 2 3 Spatial Data Structure The basic features in spatial data files can be points, lines(arcs)2 or Polygons (areas). There are other feature types that we will take up later. We will take up the structure of coverages first and then shapefiles. However, before we get into the structure of the two GI data structures we need to discuss topology. Topology Topology is a term that is very common in GIS literature and conversations. It is usually taken to mean that the data structure is designed in such a way that it is easy for the software to figure out what is next to what. However, that is NOT the only reason that some structures have topology. Coverages are usually spoken of as topological structures while shapefiles are not considered to be topological and so we need to spend some time on the subject of topology. Mathematical topology assumes that geographic objects are located on a 2-dimensional plane. These 2-D features are: 1. Nodes or non-dimensional points defined by X,Y coordinates. Sometimes3 called 0-dimensional cells 2. Edges or arcs, or lines defined by two or more nodes, sometimes called 1dimensional cells 3. Polygons or Areas defined by 3 or more arcs and nodes, sometimes called 2dimensional cells For a topological database the points, lines, and areas exist in 2-D space. The rules for this space say that lines cannot cross without a node at the intersection, in which case the crossed lines become 3 or more lines, and polygons cannot overlap or have multiple parts. Note also that a polygon is a closed set of arcs containing a label point to which the polygon attributes are attached. The major advantages of applying topological rules to a data set being constructed was that the software could enforce topological rules and thereby help reduce errors in digitizing. Lines that intersected with a node being present, polygons that did not close, and overlapping polygons could be identified and corrected relatively easily. ESRI products contain at least 2 modules to do this work: BUILD and CLEAN. CLEAN is used to remove overshooting arcs and create nodes where arcs intersect while BUILD actually builds the topological structure. Any coverage that contains a pat and/or an aat file has been build – but that is not proof that the errors have been corrected. 2 You might as well get used to the fact that the terms arc and line are used interchangeably: they mean the same thing and you will find both terms being used in the GIS community. The same is true for polygon and area; they mean the same thing. 3 In the U.S. Census data descriptions. Chapter 2 4 Having a clean topological structure also allowed for relatively easy identification of polygons that were adjacent to one another and that was important for analysis. But not necessary for analysis as you will see. Coverage structure In coverages, the basic data V6 is points. Points are defined 7 N1 spatially by X, Y coordinate V1 V5 6 D 6 4 pairs. X and Y are usually B in real world coordinates N4 V4 5 1 C 5 3 like longitude and latitude V3 4 N5 but may be in table inches Y 3 N2 N3 or any other coordinate 2 system. In arc and polygon 2 V2 A coverages there are two 1 types of points: Nodes and 0 Vertices. Nodes mark the 0 1 2 3 4 5 6 7 beginning and ends of arcs. Nodes that do not have two Node X V=Vertix or more arcs connected to Vertex N=Node them are called pseudoA= Polygon Arc, 1 Arc # nodes and may be errors. Vertices are used to shape the arc and are not Figure 2.4. Showing graphic structure of a polygon connected to any other arcs. coverage. Lines (arcs) are built from points and polygons (areas) are built from lines. Figure 2.4 shows a simple coverage that contains 4 polygons, 6 Arcs, 5 Nodes and 6 vertices. The X and Y coordinates locate the structure in space. Note that each arc has a direction as show by the arrow and that individual arcs may have 1 or more segments and 0 or more vertices. Table 2.1 shows the node and vertex coordinates making up each arc in Figure 2.4 while Table 2.2 shows the topological structure. Table 2.1. List of Nodes and Vertices and their coordinates making up each arc in Figure 2.4 Arc # 1 2 3 4 5 6 N1@4,6 N2@1,3 N3@5,3 N1@4,6 N5@2,4 N1@4,6 V1@2,6 V2@3,2 N4@5,5 N4@5,5 N5@2.4 V6@6,7 List of Nodes and Vertices N2@1,3 N3@5,3 V5@7,6 V4@6,5 V3@6,4 N3@5,3 Table 2.2. Table showing the topological structure of the coverage in Figure 2.4 Arc # Chapter 2 From Node To Node Left Poly Right Poly 5 1 N1 N2 B A 2 N2 N3 B A 3 N3 N4 B D 4 N1 N4 D B 5 N5 N5 B C 6 N1 N3 A D There are several important features of this structure. The most important are its topological properties. Contiguity is maintained through the fact that each arc has direction and thus the polygons on the right and left of the arc can be determined. This means that the system “knows” that, for example, that polygons B and D are next to each other across arcs 3 and 4. This fact is useful in carrying out analyses dependent on knowing what polygons are adjacent to one another. Connectivity is maintained because arcs connect to each other at nodes. Another feature of the structure is that there are no duplicate arcs between polygons. In some GISs this is not true and polygons B and D would have completely closed arcs or rings of their own. This means that the arcs between the two polygons might NOT be the same and some error could be introduced. The structure also shows that arcs can be simple arcs, that is, straight lines between nodes or they can be more complex structures using vertices to define curved or non-straight arcs. Still another feature is that the polygons must have labels or else there is no way to identify right and left polygons for the arcs. Arc 5 has a special node, N5, called a pseudonode. A pseudonode is any node that has only two arcs connected to it. In the case of closed polygon like D this is not an error. A pseudonode is considered an error when it just connects two arcs and there was no other reason for it to be there. The coverage could have been an arc coverage. In this case Table 2.2 would not exist since all that is needed is the list of nodes and vertices for each arc. If there were no arcs then the coverage would have been a point coverage and the only information stored would be the coordinates of each point. Note that since each polygon in a polygon coverage must have a label point it is impermissible to have both points and polygons in the same coverage. Although it is possible to have both arcs and points in one coverage good practice says to keep the different coverage types separate. Shapefile structure Like coverages, a set of ArcView files can represent points, lines, or polygons (and regions also – see later). Figure 2.5 shows a collection of arcs and their nodes as they would be constructed in a shapefile. These are called polylines. The polylines are defined by the ordered sequence of nodes making up the feature. The feature at A is a Polyline with 3 connected components or parts but has only one record in the attribute table. However, there is no rule that says Polyline features have to be connected. Thus, Even though the arcs are not connected in Figure 2.5 they could still be a single arc as far as ArcView is concerned and be represented as one record in the attribute table. Hence the term Polyline is used. Chapter 2 6 0 Polygon structures in shapefiles are a little more complex. There can be multiple parts to a polygon shape just like for arcs in a arc shape. Figure 2.6 shows a typical polygon theme with vertices identified by number. Technically the polygons are rings and the lines connecting the vertices always go in a clockwise direction. Look at the polygon attribute file for the polygons A, B, and C. Note that for polygon A the vertices start a 1 and then go clockwise to 14,13,… and end up a 1 again. Since the arcs defining the polygons always go clockwise the polygon is always to right of the 6 1 2 3 A 5 4 7 Figure 2.6 A Polyline features in a shapefile. Figure 2.7. Structure of a polygon shape file showing the polygons, the vertices, and the attribute files for both. bounding arcs. This is a fairly complex polygon theme because it shows both an island Chapter 2 7 Figure 2.8. ArcView polygons are complete with all boundaries. polygon (C ) and polygon B has a hole in it. Although it appears that polygons A and B have a common boundary, they, in fact, do not as is Figure 2.9. What appear to be 5 polygons are, as far as ArcView is concerned, actually only 3. shown in Figure 2.8. In this figure, we have dragged polygon B away from A so that you can see that each polygon is complete. No shared boundaries in this case. The theme was constructed using the autocomplete function in ArcView that forces the vertices that appear to be common in Figure 2.7 to be duplicated for polygon B. Each of the polygons is complete, there are NO common edges! With shapefiles it is also possible to have polygons with multiple parts. Figure 2.9 shows 3 polygons, 1, 2, and 3. Polygons 2 and 3 have had a strip erased so that these 2 polygons appear to be 4 polygons. However, the attribute table shows that as far as ArcView is concerned there are still only 3 polygons in the view. Polygons with multiple parts can be created during analysis and although this can sometimes be a problem it is a real when dealing with tax parcels, for example, that have been split by a new road right of way. Although shapefiles do not have topology, they can still be used in analysis because the software can compute the necessary relationships on the fly. With modern computers, this is not a problem since computational speeds are very high. Because of the relatively simple data structure shapefiles draw more rapidly than do coverages. E00 AND OTHER EXCHANGE FILES In order to move data between different systems ESRI has what are called Exchange files, commonly referred to as E00 files because that is the extension on the files. These files are ASCII (text) files and can be read by most ESRI software regardless of platform (Windows, Unix, etc.). They are disk space hogs, however. With today’s software Shape files are a better alternative but cannot be read by all software on all systems. TERMINOLOGY REVIEW Shapefile :A drawing file with out topology (in the usual sense) designed for use with ArcView and ArcGIS Coverage : An ESRI vector data structure with topology Directory : AKA windows folder E000 file : ArcInfo ASCII exchange file Chapter 2 8 Folder : AKA Directory Node: A point that is used to start and stop arcs Polygon: An arc structure that closes on itself Tin: Triangulated Irregular Network – use to model continuous surfaces Topology: Explicit spatial relationships between features Vertex: A point that is used to control the shape of an arc Workspace: A directory (folder) that contains an INFO file and is used to hold geographic data pertaining to some project or set of data. SUMMARY There are two aspects of GIS data files that are important to the efficient use of the technology: the actual data model or structure of the GI data files and their file structure. The important aspects of file structure for coverages and shapefiles are as follows: 1. Coverages are folders (or directories if you prefer). If the coverage is other than a PC ArcInfo coverage then they are located in a workspace. Having an INFO folder identifies a workspace. Coverages in a workspace MUST be copied or moved using the management tools in the software and must NEVER be copied or moved by DOS commands or Windows click and drag techniques. PC folders are also folders but can be moved with normal DOS or Windows commands. 2. Coverages are topological databases and have a complex data structure. The y are folders (directories) containing a set of files that, for the most part, contain pointers to data stored in the INFO directory within the workspace. You can only move workspaces with windows or DOS commands. You must use ArcInfo data management tools or ArcCatalog to move individual coverages. 3. Shapefiles are simpler than coverages and can be moved through the use of DOS and Windows techniques. But you have to be careful that you get all of the files describing a shape. Shapefiles are not topological Chapter 2 9