# 20 Minutes to Understanding Spatial Database

posted 3 years ago in Dev Platform category by Ki Sun Song

Spatial RDBMS is an RDBMS that can process spatial data. Popular RDBMSs, such as Oracle, offer their own Spatial RDBMS features or add-ons so that spatial data can be processed.

Since each DBMS has a different architecture, it is difficult to show how it operates through a simple diagram. But we can explain at least the concept of a spatial DBMS through the following diagram.

Spatial RDBMS allows to use SQL data types, such as *int* and *varchar*, as well as spatial data types, such as *Point*, *Linestring* and *Polygon* for geometric calculations like distance or relationships between shapes.

RDBMS uses the **B-Tree** series or **Hash Function** to process indexes (see CUBRID Query Tuning Techniques for more explanation), basically to determine the size or redundancies of column values. In other words, only one-dimensional data can be processed. Since spatial data types are two or three dimensional:

**R-Tree**series or**Quad Trees**can be used in Spatial RDBMS to process such data;- Or it is necessary to transform two or three dimensional data to one dimensional data, then
**B-Tree**can be used.

Many benefits follow if the existing RDBMS is extended to process spatial data. First, even when conducting geo-spatial tasks, there will be many occasions when basic data types, such as numbers or characters, are used. Another benefit is that there will not be a burden of additional training, since SQL is already a verified solution which can successfully store the data.

RDBMS is not the only database management system available. Likewise, spatial RDBMS is not the only spatial database management system available. Many databases, such as MongoDB, the document-oriented database, search engines such as Lucene or Solr, provide spatial data processing features. However, these solutions offer less features and do not provide high precision calculations. To understand what high precision calculations mean, we will take a closer look at the features a spatial DBMS provides.

## OpenGIS

OpenGIS is a standard solution to process spatial data. The OGC (Open Geospatial Consortium), a consortium made up of 416 governmental organizations, research centers and companies from all over the world (as of 2011), legislates this standard. OpenGIS (Open Geodata Interpretability Specification) is a registered trademark of the OGC, and is a standard for geospatial data processing (*this document does not differentiate spatial and geospatial*).

Out of the many standards in the OpenGIS, the one standard needed to understand the spatial DBMSS is Simple Feature.

### Simple Feature

As mentioned above, Simple Feature is a standard needed to process the geospatial data. **Geometry Object Model** (spatial data type), **Spatial Operation and Coordinate System** (two and three dimension) are subject to the Simple Feature standard. Geometry Object Model are figures such as *Point*, *LineString* and *Polygon*.

The Geometry Data Model (spatial data type) can exist in not only two dimensions but three dimensions as well. The area dealt by Simple Feature is **Euclid**. Therefore, spatial operation on intersects and touches are all Euclid geometry areas. In the Simple Feature Statement document, the Geometry Object Model is dealt like an Object-Oriented languages class. An actual UML is used to describe. The following is a *Geometry Object Model Class Diagram*, which provides a summary of the contents in a Simple Feature Specification. *Point*, *LineString*, and *Polygon*, all inherit this geometry. ‘**Query**’ and ‘**analysis**’ blocks are Spatial Operations.

+ dimension() : Integer + coordinateDimension() : Integer + spatialDimension() : Integer + geometry Type() : String + SRID() : Integer + envelope() : Geometry #query + equals(another :Geometry) : Boolean + disjoint(another :Geometry) : Boolean + intersects(another :Geometry) : Boolean + touches(another :Geometry) : Boolean + crosses(another :Geometry) : Boolean + within(another :Geometry) : Boolean + contains(another :Geometry) : Boolean ... #analysis + distance(another : Geometry) : Distance + buffer(another : Distance) : Geometry + convexHull() : Geometry ...

The standards document uses a formula to describe the operation.

The *Within* calculation shown in the above Figure can be explained in the following formula where **I()** function is *interior*, **E()** function is *exterior*.

(a ? b = a) ? (I(a) ? E(b)) = Ø )

Operations such as **buffer()** are used frequently. When a point unit geometry is given as an argument, then this is processed by **buffer()** and returned as a geometry that is in a form of a line that is surrounded for a certain distance. **Buffer()** for point would be a circle. If the center of a road is shown with a LineString, the **buffer()** can be used to identify the road type that can put the road width into consideration. Building near a certain road can be identified using **touches()**.

The Simple Feature described so far is the "**Simple Feature Common Architecture**". The other Simple Feature standards are Simple Feature CORBA, Simple Feature OLE/COM and Simple Feature SQL.

Naturally, Spatial RDBMS is closely related to Simple Feature SQL. Simple Feature SQL includes Simple Feature Common Architecture, and deals with the standards Spatial RDBMS must have based on ANSI SQL 92. It deals with how to show the Geometry Object Model in DBMS data type and how to show the SQL function in spatial operation, etc. It also specifies the basic DBMS Table the Spatial DBMS must have. SPATIAL_REF_SYS, a table that contains the Spatial Reference ID is a good example.

A famous open source library that implements the Simple Feature specifications is JTS Topology Suite, which is written in a Java, and GEOS, a C++ port of JTS. GEOS is used in PostGIS (package that is added on to the PostgreSQL to process Spatial data).

## Spatial Reference

Explanations or figures regarding Spatial Reference is from http://www.sharpgis.net/post/2007/05/Spatial-references2c-coordinate-systems2c-projections2c-datums2c-ellipsoids-e28093-confusing.aspx.

Spatial DBMS is most widely used in areas such as transportation (air/shipping) and astronomy. The calculation used for oil-pump construction and to set flight courses can be explained in geometric terms as calculating the length of a line that is connected from one dot to the other on the Earth’s surface.

Locations on the Earth’s surface is usually marked using the longitude and latitude.

As seen in the above figure, a one degree longitude difference differs greatly depending on the latitude. A longitude 1° from the equator is 111.321 km, but longitude 1° from latitude 60° is only 55.802 km.

A three dimensional Earth surface layout must be modified to a two dimensional map for easy calculation of width and distance. This is called **projection**. However, it’s almost impossible to show a round Earth in a two dimensional form so that it is both precise and simple.

The most famous map in the world, the Mercator Projection, is suitable for nautical purposes. The angles of the circular Earth are identically portrayed in a flat surface. Therefore, the image is larger than it actually is near the pole areas.

Plate Carrée Projection (also known as Equirectangular Projection) is better than Mercator Projection when calculating lengths or widths.As the figure above shows, the location is projected with a square grid. Therefore, the calculation of distance from north, south, east and west is extremely precise. Plate Carrée Projection is currently used as the *de facto* method in the GIS area and has been adopted by the Google Earth service.

However, we still have to find the longitude and latitude. As shown in *Geographical Coordinate System* figure, we need to know where the center of the Earth is to calculate the longitude and latitude using the meridian.

The Earth is neither globular nor oval. On top of that, the Earth surface’s position shifts due to the Earth crust’s movement. The Earth surface moves 1cm every year, even without a severe earthquake. Therefore, the center of the Earth always changes.

The difference in distance between the center of a virtual globe and the center of the ever-changing oval globe is called the **Geocentric Datums**, or just **Datums**.

Although the location is identical on the surface, the longitude and latitude values can change depending on the Datums. The Datums currently most widely used as a “standard” is the value determined by the WGS84 (World Geodetic System), which is the standard method used in GPS.

Spatial Reference is a combination of the coordination system, projection and datum which has been explained up until now. The EPSG (European Petroleum Survey Group) created an ID system for the coordination system, projection and datum.

When we say the EPSG:4326 is used as a spatial reference, we mean that the Plate Carrée Projection is used with the longitude/latitude value based on WGS84. EPSG:4326 is a spatial reference used for GPS satellites and NATO military purposes, and is most widely used in the world.

In the above *Geometry Object Model Class Diagram*, the Class Diagram contains a method **SRID()**, which returns this spatial reference ID. Distance and width calculation is done based on spatial reference. In order to conduct a spatial operation on a geometry model with different spatial reference IDs (SRID), one of the two must be changed to the other’s SRID. In a special DBMS that goes by the Simple Feature standard, this SRID is saved in the SPATIAL_REF_SYS table. There are more than 3,000 SRIDs. http://spatialreference.org/ref/?search=korea shows SRIDs of the Korean Peninsula.

## Example of Spatial SQL

The following example shows the **Nearest Neighbor search**, which can find N number of users within an M meter radius from a specific user’s location, made with a Spatial SQL.

Let’s say that the *userLocation* table contains a field called *user ID*, which uses a character string, and *location*, which uses the *point* data type to show the location. If all *location* fields in the *userLocation* table uses the same SRID, then the Nearest Neighbor search can be executed easily, as shown in the following code snippet (*this SQL example is active in PostgreSQL/PostGIS*). Functions that start with *ST_* is a spatial type and is the standard function defined in Simple Features.

SELECT SIEVE.userID, SIEVE.DISTANCE as distance FROM (SELECT USERS.userID, USERS.location , ST_Distance (SPOT.location::geography, USERS.location:: geography) as DISTANCE FROM userLocation AS USERS, (SELECT location FROM userLocation WHERE userID = #userID#) AS SPOT WHERE USERS.userID != #userID# and ST_DWithin(SPOT.location, USERS.location, #diameter# ,false ) ) AS SIEVE ORDER BY SIEVE.DISTANCE limit #howMany#

## Representative Spatial RDBMS

The most representative spatial RDBMS are, and PostgreSQL for open source RDBMS, and MS SQL Server and Oracle for commercial DBMS. Function-wise, there are three strong RDBMSs (PostgreSQL, Oracle, MS SQL Server) and one medium (MySQL). (*Test results can be found here and here.*)

MyISAM must be used as a pool when using the Spatial Extension in MySQL. While PostgreSQL/PostGIS supports more than 300 Simple Feature functions and operators, MySQL provides only a few functions that uses MBR (Minimum Bound Rectangle). Since Oracle took over MySQL, no additional developments of Spatial Extension have progressed.

The MS SQL Server provides the spatial function in the release and does not require a separate add-on. Therefore, departments that use a version above MS SQL 2008 can use the spatial function without a separate procedure.

Oracle provides two separate spatial functions; Oracle Locator and Oracle Spatial. Oracle Spatial is the higher function and does not require additional license purchase in an environment that uses Enterprise Oracle.

## Spatial Index

In the Spatial Index, the MBR (Minimum Bound Rectangle), also known as **bounding box** or **envelope**, is sought after in order to index the spatial data type. As shown in Figure 3, the Simple Feature requires all spatial data to provide a method that returns MBR.

MBR is also used when two spatial data is combined ((MINX MINY, MAXX MINY, MAXX MAXY, MINX MAXY, MINX MINY)). Equivalence/size relationships in one-dimension can be expressed in space as intersect, cross, within or touches, and MBR is to be used to effectively process the relationship. When calculating relationships (intersects, touches, etc.) on two or more spatial data, use the MBR for linear computing.

The figure above illustrates an R-Tree configuration with 4 sequences. The MBR that covers the spatial data is D to M and MBR created for indexing is A to C.

Let’s create an example where spatial data, which includes a random point, is searched. If the point is in a “within” relationship with D, G or M, then the root to leaf node can be found immediately. If the point is in the area F and J intersects, then in the root node, A will be visited first then return F, then move to root node, select C and return J. This is where the problem lies. The example in Figure 2 only has two tree levels. However, it will take a lot of time to search the tree if there is a lot of data.

These problems have been resolved in R^{+}-Tree.

In the figure above, the spatial data configuration from D to M is the same as in *R Tree configuration*. However, the A and C MBR area is different. In R^{+}-Tree, if needed, the overlapping areas in the same level can be brought from different nodes. You can see that J and D are redundant.

When such redundancies occur, when looking for the area F and J are intersecting, you need only to search in A MBR and do not have to search C.

## When Contemplating the Application of Spatial RDBMS

In order to use the location based service using Spatial RDBMS, check whether the Spatial RDBMS satisfies the performance the service needs. MySQL needs to be checked to see whether the necessary functions are provided, but PostgreSQL/PostGIS does not require an examination.

When implementing the Nearest Neighbor search with Spatial DBMS, the R-Tree index will contain numerous Point data that shows where the user location is. In order to search a user (Point) who is within an “N” meter radius of a specific location, acquire an MBR for the “N” meter (circle) radius and search the point the MBR is “within” from the root node. If many MBRs in the leaf node overlap, then more time is needed for the search. (*Not all Spatial RDBMS uses R-Tree series as index. MS SQL Server uses Z-order enumeration to change spatial data to one-dimensional data, and indexes using B-Tree. IBM DB2 uses QuadTree.*)

Since RDBMS has to guarantee durability, logs are left on the disk for every operation, which places a severe load on the I/O. The RDBMS’s greatest fault is in the scalability. Spatial RDBMS is not adequate if multiple users have to process in real-time at the same time.

Key-value DB that guarantees scalability uses the *hash* function as the index, so it is vulnerable when searching ranges (*there is no key-value DB that offers spatial index yet*). Mongo DB or Lucene has greater IO performance than RDBMS (durability not guaranteed), but they use only the most basic spatial index, and the functions are poor.

If there is a limitation in the number of spatial data or operation, or already using MS SQL Server 2008, MySQL 5.1 above MyISAM, then the Spatial RDBMS would be a practical choice since there is no burden of adding additional platform infra.

By Ki-sun Song, Web Platform Development Lab, NHN Corporation.