Databases, Spatial Databases, and PostGIS Overview of relational database concepts and PostGIS DBMS Perspective From Spatial Databases, A Tour, by Shekhar and Chawla Role of DBMS Typical small system architecture: Role of DBMS Larger spatial DBMS From “Introduction to Spatial Data Management with Postgis,” by Arnulf Christl http://www.ccgis.de, http://www.mapbender.org/ Definitions Database Management Systems (DBMS) – “Software that controls the organization, storage, retrieval, security and integrity of data in a database. It accepts requests from the application and instructs the operating system to transfer the appropriate data.” (Computer Desktop Encyclopedia, 2007) Key features: – – – – – Data Security Data Integrity Interactive Query Interactive Data Manipulation Data Independence May be Network, Hierarchical, Object, Relational. Relational is by far the most commonly-used and wellestablished, and handles most data management problems very well. Definitions Relational Database Management Systems (RDBMS)– “A database that maintains a set of separate, related files (tables), but combines data elements from the files for queries and reports when required. The concept was developed in 1970 by Edgar Codd, whose objective was to accommodate a user's ad hoc request for selected data.” [See Codd Article for details.] (Computer Desktop Encyclopedia, 2007) Data stored in separate tables, each containing tabular data like a spreadsheet, joined together as needed. In the early days of RDBMS, many vendors claimed to offer relational databases when they did not – Codd came up with 12 rules defining the requirements for a database system to be truly relational. Tables Relational tables have these properties: Column Values Are of the Same Kind Each Row is Unique The Sequence of Columns is not significant The Sequence of Rows is not significant Each column must have a unique name (within the table) Represents a single entity. Example: Columns Table columns have these properties: They have a data type (similar to variables, but slightly different) – char, varchar, number, float, text, BLOB (image), etc. Column values should be independent from each other. Values may be required (not null), or nullable. (Some databases differentiate between null and zero-length string). Columns may be indexed to improve access speed. A column may be used as the basis of the order of the physical data (clustered index). Each represents an independent attribute. Primary Keys The primary key is an attribute or a set of attributes that uniquely identify a specific instance of an entity. Every entity in the data model must have a primary key whose values uniquely identify instances of the entity. To qualify as a primary key for an entity, an attribute must have the following properties: – it must have a non-null value for each instance of the entity. – the value must be unique for each instance of an entity. – the values must not change or become null during the life of each entity instance Composite and Artificial Keys Sometimes more than one attribute is required to uniquely identify an entity. A primary key that made up of more than one attribute is known as a composite key. An artificial key is one that has no intrinsic meaning. These can be very useful when no attribute has all the primary key properties, or the meaning of the primary key is otherwise complicated or conditional. Foreign Keys A foreign key is an attribute that defines a relationship between two tables. Foreign keys provide a method for maintaining integrity in the data (referential integrity). Every relationship between entities must be supported by a foreign key. A foreign key is an attribute in one table and a primary key in another. Typically, an RDBMS will let you specify whether dependent entities are deleted (cascade delete) when a parent is deleted, or disallowed. This ensures the referential integrity of the database. Foreign keys make possible establishing relationships in an Entity-Relationship Diagram [see sample below]. Primary and Foreign Keys Example: http://msdn.microsoft.com/en-us/library/ms171971.aspx Indexes Disk reads are expensive. Finding a record in a large table can be slow, often requiring a full table scan, that is, checking every row in the table. Indexes allow for much more efficient searching. If the performance of a query is bad, most of the time it can be fixed with better indexes. Indexes tend to decrease query time, but increase insert/delete time, since they need to be adjusted (automatically) when the data changes. Indexes From Spatial Databases, A Tour, by Shekhar and Chawla Indexes Example of an index minimizing disk access: Indexes Spatial data doesn’t have a simple natural ordering like most traditional data types. In PostGIS, interacting with spatial indexes is similar to other data types, but internally is handled differently using the Generalized Search Tree algorithm. This allows flexible partitioning based on the kind of data being indexed, rather than on an existing alphanumeric sequence PostGIS has implemented spatial partitioning algorithms for GIST indexes. http://gist.cs.berkeley.edu/gist1.html Other Features of RDBMS Triggers Transaction Support Stored Procedures Views User-Defined Functions User-Defined Data Types Extended features: – Full-text search – Spatial data Replication Others… Normalization The most elementary structure for a database is a single table, which corresponds to a spreadsheet or attribute table. In this model there is one kind of entity (consisting of rows) which has a uniform set of attributes (consisting of columns). This is fast and simple, but very limited. Example of un-normalized data – NYC PLUTO tax lot data: Normalization While this might adequately capture the data we need, it has some drawbacks: – There will be lots of redundant data if many parcels are added. The borough and owner names, for example, will all be replicated for each row where these values are the same. This makes maintaining the data difficult. – What happens if a single parcel is zoned for more uses than the number of columns allow? Or if one parcel has several owners? – How do you know for sure which parcel is which? Normalization The solution is to begin normalizing the data. You can look at this as a three-step process implementing each of Codd’s “Normal Forms”. First normal form rules: – A row of data cannot contain repeating groups of similar data (atomicity); and – Each row of data must have a unique identifier or primary key. Second normal form rules: – No attribute can be dependent on only a portion of the primary key. – Every column must depend only on the entire primary key; if it is dependent on one or more other columns, these should be moved into new tables. Third normal form: – No dependencies within non-key attributes. Normalization Normalization usually involves creating multiple tables, each of which complies with the three normal forms. A normalized version of the PLUTO data, for example, would likely have separate zoning and owner tables. However, in practice it is often useful to de-normalize the database structure. The PLUTO data in its current form is extremely usable, and de-normalizing it for usability makes sense. Typically reporting applications will use some form of denormalization to improve query performance, while more transactional applications will use normalized database structures to minimize redundant updates. SQL Structured Query Language is the interface you use to communicate with an RDBMS. It consists of Queries; DML – data manipulation language; and DDL – data definition language. SQL is standardized, but different DB vendors have different flavors and extensions (Oracle Spatial, for example, adds spatial keywords to SQL). SQL is not really a full featured language like C#, although most database vendors have SQL-based languages like PL/SQL or TSQL that let you embed SQL statements directly in procedural code. Typically, a program will interact with a database by submitting SQL statements, one by one, to a database using a data access layer that sends the requests to the database and returns the results to the program. SQL DDL statements - examples: CREATE TABLE - creates a new database table ALTER TABLE - changes a database table’s structure DROP TABLE - deletes a database table CREATE INDEX - creates an index DROP INDEX - deletes an index SQL Queries: SELECT – extracts data from a database table DML statements: UPDATE - updates data in a database table DELETE - deletes data from a database table INSERT - inserts new data into a database table SQL Return all the rows in a table: SELECT * from TableName Order by a field: SELECT * from TableName ORDER BY ColumnName Add a where clause: SELECT * from TableName WHERE ColumnName < 5 ORDER BY ColumnName Joining two tables: SELECT Table1.Column1, Table2.Column1 FROM Table1 INNER JOIN Table2 ON Table1.Id = Table2.Id Others: SELECT DISTINCT Column1 from TableName SELECT COUNT(*) from TableName see this discussion for others. http://db.grussell.org/imp.html Data Access Layer Databases come with interactive SQL query interfaces. Programmatic interfaces are provided by the DB vendors as well as third party vendors. There are many different standards and conventions such as ODBC, ADO, ADO.NET, JDBC, OLE DB (to see a list of Microsoft’s for example, see http://msdn2.microsoft.com/enus/library/ms810810.aspx. Data Access Layer Typical application sequence: – Connect to database Requires editable string with database connection info; usually stored in configuration file. Requires exception handling in case connection is unsuccessful! – Execute SQL – Iterate through results (if query) Display/output data – Close connection Data Access Layers ADO.NET is commonly used in C#: SQLServerConnection Conn = new SQLServerConnection("host=nc-star;port=1433; User ID=test01;Password=test01; Database Name=Test"); try { Conn.Open(); } catch (SQLServerException ex) { Console.WriteLine(ex.Message); return; } try { string strSQL = "SELECT ename FROM emp WHERE sal>50000"; SQLServerCommand DBCmd = new SQLServerCommand(strSQL, Conn); SQLServerDataReader myDataReader; myDataReader = DBCmd.ExecuteReader(); while (myDataReader.Read()) { Console.WriteLine("High salaries: " + myDataReader["ename"].ToString()); } myDataReader.Close(); Conn.Close(); } catch (Exception ex) { Console.WriteLine(ex.Message); return; } Spatial Data Spatial data can be represented using a relational database structure. From Spatial Databases, A Tour, by Shekhar and Chawla But this is complex and inefficient, requiring multiple tables to capture topology. Usually spatial data can best be captured by a binary column that contains all of the geometry associated with a record. A table in a spatial database will usually have a single geometry column of a single type, although a generic Geometry column can support multiple geometry types. Spatial Databases Spatial databases are usually built on top of RDBMS. Spatial requirements are more involved, including referential integrity based on topological geometrical relationships, not just foreign-key constraints. OGC has a SQL spec for simple features storage very similar to the simple features spec used for geo tools: http://www.opengeospatial.org/standards/sfs PostGIS complies with OGC spec. Spatial Databases Sample join with spatial columns: From Spatial Databases, A Tour, by Shekhar and Chawla Spatial Databases Mechanics of a spatial query: From Spatial Databases, A Tour, by Shekhar and Chawla PostGIS PostGIS is the spatial add-on for PostgreSQL Compares well to others but is free! (www.postgresql.org). Based on relational structure with support for spatial data types and spatial functions. Can be used to serve up map layers. Can also be used for typical database transactional functions. No intrinsic support for raster data, unlike ArcSDE or Oracle Spatial. PostGIS RDBMS use database tables and columns to represent the database structure itself. PostGIS supports GIS data by adding additional metadata tables: – spatial_ref_sys – contains a definition for each available coordinate system and projection. These can be added or edited as needed. – geometry_columns – contains a definition of each geometry column in the current database. http://www.opengis.org/techno/interop/EPSG2WKT.TXT http://www.opengis.org/techno/specs.htm PostGIS PostGIS geometry is based on the Open Geospatial Consortium simple features specification: http://www.opengeospatial.org/standards/sfs PostGIS Uses human-readable well-known text format to represent geometry (converts to binary internally): From “Introduction to Spatial Data Management with Postgis,” by Arnulf Christl http://www.ccgis.de, http://www.mapbender.org/ PostGIS Sample PostGIS SQL - note special role of AddGeometryColumn, which manages geometry_columns table. From “Introduction to Spatial Data Management with Postgis,” by Arnulf Christl http://www.ccgis.de, http://www.mapbender.org/ PostGIS Examples of supported functions – Comparisons: – ST_Equals, ST_Disjoint, ST_Intersects, ST_Touches, ST_Crosses, ST_Within, ST_Contains, and ST_Overlaps. Set operations: – ST_Intersection, ST_Difference, ST_Union, ST_SymDifference, ST_Buffer, and ST_ConvexHull Others: – AsText, GeometryFromText, Transform (re-project). PostGIS Non-spatial data can be loaded from the SQL interface using the Copy command. Spatial data can be loaded using a command-line utility that comes with the database, or a GUI shape file loader. From “Introduction to Spatial Data Management with Postgis,” by Arnulf Christl http://www.ccgis.de, http://www.mapbender.org/ PostGIS Data can be exported from PostGIS using the query interface. Spatial data can be exported to a shape file using the pgsql2shp command-line utility. The basic syntax is: psql2shp [<options>] <database> <SQL query> See http://www.bostongis.com/postgis_quickguide_1_4.bqg PostGIS PGAdmin is easy-to-use admin interface for Postgresql database. It includes a pane for DDL and an interactive SQL query tool. PostGIS Sample application: NYC Solar Map – Vector data and raster tile metadata stored in PostGIS database – Raster tiles stored on file system – Data accessed managed by C# web services – Web services communicate with web application via JSON (Javascript object notation), with GeoJSON extensions. – Client application uses Javascript to display map layers and query C# web service