Data Structures

advertisement
Chapter 2
Data Structures
VECTOR DATA STRUCTURES
There are numerous spatial data structures or data models used in GIS software. Figure
1.1 diagrams the relationships between common data models. We are primarily
concerned with two data structures: coverages and shapefiles. There are two aspects of
the structures: the folder and file structure, that is, how the files that contain the data, and
the actual data model or the organization of both the spatial and non-spatial or attribute
data.
Folder and File structure
There are two primary spatial data file structures we need to discuss: the disk structure of
the two file types we are going to be using in this course. The two structures are
coverages and shapefiles. It is very important that you understand the differences in
how these data files are stored on the disk because incorrect management is a major
problem for beginning users of these files. Although we are going to be using only
coverages and shapefiles and several kinds of attribute data files there are many other
forms of GIS data as outlined in Figure 1.1. The vector data model, as described in
Chapter 1, is based on points and their X,Y coordinate pairs. These points are used to
construct the other feature types (lines and polygon). Under Vector models, the figure
shows topological and non-topological models. Shapefiles are considered to be nonGI
Spatial Data
Attribute Data
Access
Access
Vector
Raster
DBase
DBase
Grid
NonTopological
Shapefile
Shapefile
Other DBs
Topological
High level
Data Models
Simple Data
Coverage
Coverage
IDRISI
TIN
GeoDataBase
GeoDataBase
Regions
Dynamic
segmentation
Object Oriented
Figure 1.1. A schematic of data models. After Chang, 2002.
Chapter 2
1
topological. It might be better to say that shapefiles are designed so that topology can be
constructed on the fly and is not built into the data. As a result, shapefiles load and draw
faster then coverages. Coverages are the topological data model spatial data constructs
that we will be using. TINs are triangulated irregular networks used to represent
continuous data like elevation and pollution concentrations. TINs are made up of
triangles whose edges connect three points defining, say, elevation at spots. The slope
and aspect of the triangular polygons can be calculated easily and other raster type of
operations can be carried out with this data structure. Regions are an advanced polygon
structure in which the polygons may overlap and the logical polygon may be made up
several unconnected graphic polygons. Dynamic segmentation is a model based on the
line or arc feature type and allows for the assignment of different attributes to different
parts of an arc. The GeoDataBase is a complex structure that uses point, line, and
polygon geometries to represent the graphic part of GI. Point features can consist of a
single point or a set of points. A line feature is made up of a set of arcs that are not
necessarily connected. A polygon feature is made up of one or many rings that may or
not be connected. While with the other structures the attribute data was stored separately
from the spatial data that is not the case with the GeoDatabase. Both spatial and attribute
data are stored in the same database.
The way in which coverage and shapefile data is stored is quite different and can cause
considerable confusion in novice users. We will take up the file structure of these data
model type separately.
Coverages
A coverage is a topological data structure (see later for definition of topological). On
disk, a coverage is defined by the data stored in a folder. Figure 2.2 shows the coverages
for a database called Yellowstone. Note that the Yellowstone database includes a
folder called info as well as coverages various versions of land use. The info folder is
important since that is where the data is actually
stored but the really important point from a data
management point of view is that any folder of GIS
data that contains an info folder is a workspace.
Any folders inside a workspace, that is, any folder
containing an info folder, cannot be moved by
dragging or copying and The1 files inside the
coverages actually contain only pointers to data in
the info folder so moving them in explorer or in
DOS will cause destruction of the coverage! You
must use the data management tools in the ESRI
softwares. In Figure 2.2, the coverage is called
Folders inside a workspace, that is, any
folder containing an INFO folder, cannot
be moved by dragging or copying and
pasting!
Chapter 2
Figure 2.2 Windows Explorer view
of
the
database
called
Yellowstone.
2
INLANBUF and it is and it is located in a workspace called LANDED although you can’t
tell that from the figure. You can tell it is a workspace from the fact that there is an
INFO file. The contents of some of the files in the coverage INLANBUF are






pat.adf – polygon or point attribute table. contains information about the
polygons (or points if it is a point coverage),
aat.adf – not in figure 2.2 but if present contains the arc attribute table.
tic.adf – Contains the tic coordinates for the coverage. Tics locate the coverage in
space
bnd.adf – Contains the coordinates of the current bounding rectangle for the
coverage. The bounding rectangle is also called the extent of the data
prj.adf – Contains projection, datum, and units information about the coverage
arc.adf – Contains information about the arcs in the coverage
If the coverage was a PC ArcInfo coverage then the files with names like pat.dbf and
tic.dbf would be, respectively, the actual polygon and tic database files and NOT
pointers to data in an info file. Since the files in workstation ArcInfo have an .adf
extension you know they contain only pointers to the data stored in the info folder.
Coverages produced and used by PC ArcInfo do NOT have the info folder structure and
thus may be moved or copied and pasted in windows explorer or DOS.
Make sure you understand the points made above about the folder and file structure of
coverages. If you don’t understand it you will
undoubtedly make frustrating if not serious
errors in the management of coverages.
Shapefiles
Shapefiles do not have the same kind of
topology as coverages. The graphic features in
shapefiles are just that, graphic objects and, as
you will see, the kinds of analyses that can be
performed on shapefiles is quite different than Figure 2.3. Contents of a simple
the analyses that can be carried out with shapefile.
coverages. Figure 2.3 shows the data files that
make up a shapefiles called AccessRd. The file with the .dbf extension is the attribute
data file, the .shp file stores the geometry and the .shx file stores the index of the of the
feature geometry. These are the three files that must be present for a working shapefile.
Other files that may be in a shapefile are the .sbn and .sbx files that store the spatial index
of the features and the .fbn and .fbx that store the spatial index of the features for
shapefiles that are read-only. There may also be .ain and .aih files store the attribute
index of the active fields in a table’s or a theme’s attribute table, the .xml file that stores
metadata for use in ArcInfo 8, and the .avl file that stores legend information. For now
the important point is that if you move a shape file in DOS or Windows Explorer you
have to make sure you get all the parts.
Chapter 2
3
Spatial Data Structure
The basic features in spatial data files can be points, lines(arcs)2 or Polygons (areas).
There are other feature types that we will take up later. We will take up the structure of
coverages first and then shapefiles. However, before we get into the structure of the two
GI data structures we need to discuss topology.
Topology
Topology is a term that is very common in GIS literature and conversations. It is usually
taken to mean that the data structure is designed in such a way that it is easy for the
software to figure out what is next to what. However, that is NOT the only reason that
some structures have topology. Coverages are usually spoken of as topological structures
while shapefiles are not considered to be topological and so we need to spend some time
on the subject of topology.
Mathematical topology assumes that geographic objects are located on a 2-dimensional
plane. These 2-D features are:
1. Nodes or non-dimensional points defined by X,Y coordinates. Sometimes3 called
0-dimensional cells
2. Edges or arcs, or lines defined by two or more nodes, sometimes called 1dimensional cells
3. Polygons or Areas defined by 3 or more arcs and nodes, sometimes called 2dimensional cells
For a topological database the points, lines, and areas exist in 2-D space. The rules for
this space say that lines cannot cross without a node at the intersection, in which case the
crossed lines become 3 or more lines, and polygons cannot overlap or have multiple
parts. Note also that a polygon is a closed set of arcs containing a label point to which
the polygon attributes are attached.
The major advantages of applying topological rules to a data set being constructed was
that the software could enforce topological rules and thereby help reduce errors in
digitizing. Lines that intersected with a node being present, polygons that did not close,
and overlapping polygons could be identified and corrected relatively easily. ESRI
products contain at least 2 modules to do this work: BUILD and CLEAN. CLEAN is
used to remove overshooting arcs and create nodes where arcs intersect while BUILD
actually builds the topological structure. Any coverage that contains a pat and/or an aat
file has been build – but that is not proof that the errors have been corrected.
2
You might as well get used to the fact that the terms arc and line are used interchangeably: they mean the
same thing and you will find both terms being used in the GIS community. The same is true for polygon
and area; they mean the same thing.
3
In the U.S. Census data descriptions.
Chapter 2
4
Having a clean topological structure also allowed for relatively easy identification of
polygons that were adjacent to one another and that was important for analysis. But not
necessary for analysis as you will see.
Coverage structure
In coverages, the basic data
V6
is points. Points are defined
7
N1
spatially by X, Y coordinate
V1
V5
6
D 6
4
pairs. X and Y are usually
B
in real world coordinates
N4
V4
5
1
C 5
3
like longitude and latitude
V3
4
N5
but may be in table inches
Y
3
N2
N3
or any other coordinate
2
system. In arc and polygon
2
V2
A
coverages there are two
1
types of points: Nodes and
0
Vertices. Nodes mark the
0 1 2 3 4 5 6 7
beginning and ends of arcs.
Nodes that do not have two
Node
X
V=Vertix
or more arcs connected to
Vertex
N=Node
them are called pseudoA= Polygon
Arc, 1 Arc #
nodes and may be errors.
Vertices are used to shape
the arc and are not Figure 2.4. Showing graphic structure of a polygon
connected to any other arcs. coverage.
Lines (arcs) are built from
points and polygons (areas) are built from lines. Figure 2.4 shows a simple coverage that
contains 4 polygons, 6 Arcs, 5 Nodes and 6 vertices. The X and Y coordinates locate the
structure in space. Note that each arc has a direction as show by the arrow and that
individual arcs may have 1 or more segments and 0 or more vertices.
Table 2.1 shows the node and vertex coordinates making up each arc in Figure 2.4 while
Table 2.2 shows the topological structure.
Table 2.1. List of Nodes and Vertices and their coordinates making up each arc in Figure 2.4
Arc #
1
2
3
4
5
6
N1@4,6
N2@1,3
N3@5,3
N1@4,6
N5@2,4
N1@4,6
V1@2,6
V2@3,2
N4@5,5
N4@5,5
N5@2.4
V6@6,7
List of Nodes and Vertices
N2@1,3
N3@5,3
V5@7,6
V4@6,5
V3@6,4
N3@5,3
Table 2.2. Table showing the topological structure of the coverage in Figure 2.4
Arc #
Chapter 2
From Node
To Node
Left Poly
Right Poly
5
1
N1
N2
B
A
2
N2
N3
B
A
3
N3
N4
B
D
4
N1
N4
D
B
5
N5
N5
B
C
6
N1
N3
A
D
There are several important features of this structure. The most important are its
topological properties. Contiguity is maintained through the fact that each arc has
direction and thus the polygons on the right and left of the arc can be determined. This
means that the system “knows” that, for example, that polygons B and D are next to each
other across arcs 3 and 4. This fact is useful in carrying out analyses dependent on
knowing what polygons are adjacent to one another. Connectivity is maintained because
arcs connect to each other at nodes. Another feature of the structure is that there are no
duplicate arcs between polygons. In some GISs this is not true and polygons B and D
would have completely closed arcs or rings of their own. This means that the arcs
between the two polygons might NOT be the same and some error could be introduced.
The structure also shows that arcs can be simple arcs, that is, straight lines between nodes
or they can be more complex structures using vertices to define curved or non-straight
arcs. Still another feature is that the polygons must have labels or else there is no way to
identify right and left polygons for the arcs. Arc 5 has a special node, N5, called a
pseudonode. A pseudonode is any node that has only two arcs connected to it. In the
case of closed polygon like D this is not an error. A pseudonode is considered an error
when it just connects two arcs and there was no other reason for it to be there.
The coverage could have been an arc coverage. In this case Table 2.2 would not exist
since all that is needed is the list of nodes and vertices for each arc. If there were no arcs
then the coverage would have been a point coverage and the only information stored
would be the coordinates of each point.
Note that since each polygon in a polygon coverage must have a label point it is
impermissible to have both points and polygons in the same coverage. Although it is
possible to have both arcs and points in one coverage good practice says to keep the
different coverage types separate.
Shapefile structure
Like coverages, a set of ArcView files can represent points, lines, or polygons (and
regions also – see later). Figure 2.5 shows a collection of arcs and their nodes as they
would be constructed in a shapefile. These are called polylines. The polylines are defined
by the ordered sequence of nodes making up the feature. The feature at A is a Polyline
with 3 connected components or parts but has only one record in the attribute table.
However, there is no rule that says Polyline features have to be connected. Thus, Even
though the arcs are not connected in Figure 2.5 they could still be a single arc as far as
ArcView is concerned and be represented as one record in the attribute table. Hence the
term Polyline is used.
Chapter 2
6
0
Polygon structures in shapefiles are a little more
complex. There can be multiple parts to a polygon shape
just like for arcs in a arc shape. Figure 2.6 shows a
typical polygon theme with vertices identified by
number. Technically the polygons are rings and the lines
connecting the vertices always go in a clockwise
direction. Look at the polygon attribute file for the
polygons A, B, and C. Note that for polygon A the
vertices start a 1 and then go clockwise to 14,13,… and
end up a 1 again. Since the arcs defining the polygons
always go clockwise the polygon is always to right of the
6
1
2
3
A
5
4
7
Figure 2.6 A Polyline
features in a shapefile.
Figure 2.7. Structure of a polygon shape file showing the polygons, the vertices, and the
attribute files for both.
bounding arcs. This is a fairly complex polygon theme because it shows both an island
Chapter 2
7
Figure 2.8. ArcView polygons are
complete with all boundaries.
polygon (C ) and polygon B has a
hole in it. Although it appears that
polygons A and B have a common
boundary, they, in fact, do not as is
Figure 2.9. What appear to be 5 polygons are, as far
as ArcView is concerned, actually only 3.
shown in Figure 2.8. In this figure,
we have dragged polygon B away
from A so that you can see that each polygon is complete. No shared boundaries in this
case. The theme was constructed using the autocomplete function in ArcView that forces
the vertices that appear to be common in Figure 2.7 to be duplicated for polygon B. Each
of the polygons is complete, there are NO common edges! With shapefiles it is also
possible to have polygons with multiple parts. Figure 2.9 shows 3 polygons, 1, 2, and 3.
Polygons 2 and 3 have had a strip erased so that these 2 polygons appear to be 4
polygons. However, the attribute table shows that as far as ArcView is concerned there
are still only 3 polygons in the view. Polygons with multiple parts can be created during
analysis and although this can sometimes be a problem it is a real when dealing with tax
parcels, for example, that have been split by a new road right of way.
Although shapefiles do not have topology, they can still be used in analysis because the
software can compute the necessary relationships on the fly. With modern computers,
this is not a problem since computational speeds are very high. Because of the relatively
simple data structure shapefiles draw more rapidly than do coverages.
E00 AND OTHER EXCHANGE FILES
In order to move data between different systems ESRI has what are called Exchange
files, commonly referred to as E00 files because that is the extension on the files. These
files are ASCII (text) files and can be read by most ESRI software regardless of platform
(Windows, Unix, etc.). They are disk space hogs, however. With today’s software Shape
files are a better alternative but cannot be read by all software on all systems.
TERMINOLOGY REVIEW
Shapefile :A drawing file with out topology (in the usual sense) designed for use
with ArcView and ArcGIS
Coverage :
An ESRI vector data structure with topology
Directory :
AKA windows folder
E000 file :
ArcInfo ASCII exchange file
Chapter 2
8
Folder :
AKA Directory
Node:
A point that is used to start and stop arcs
Polygon:
An arc structure that closes on itself
Tin:
Triangulated Irregular Network – use to model continuous
surfaces
Topology:
Explicit spatial relationships between features
Vertex:
A point that is used to control the shape of an arc
Workspace:
A directory (folder) that contains an INFO file and is used to
hold geographic data pertaining to some project or set of data.
SUMMARY
There are two aspects of GIS data files that are important to the efficient use of the
technology: the actual data model or structure of the GI data files and their file structure.
The important aspects of file structure for coverages and shapefiles are as follows:
1. Coverages are folders (or directories if you prefer). If the coverage is other than a
PC ArcInfo coverage then they are located in a workspace. Having an INFO
folder identifies a workspace. Coverages in a workspace MUST be copied or
moved using the management tools in the software and must NEVER be copied
or moved by DOS commands or Windows click and drag techniques. PC folders
are also folders but can be moved with normal DOS or Windows commands.
2. Coverages are topological databases and have a complex data structure. The y are
folders (directories) containing a set of files that, for the most part, contain
pointers to data stored in the INFO directory within the workspace. You can only
move workspaces with windows or DOS commands. You must use ArcInfo data
management tools or ArcCatalog to move individual coverages.
3. Shapefiles are simpler than coverages and can be moved through the use of DOS
and Windows techniques. But you have to be careful that you get all of the files
describing a shape. Shapefiles are not topological
Chapter 2
9
Download