GIS Data Administration 30th Edition (Fall 2011)
Fall 2011 GIS Data Administration 30th Edition
Data management is a primary consideration when developing enterprise GIS architectures. Enterprise GIS normally benefits from efforts to consolidate agency data resources. There are several reasons for supporting data consolidation. These reasons include improving user access to data resources, providing better data protection, and enhancing the quality of the data. Consolidation of IT support resources also reduces hardware cost and the overall cost of system administration.
The simplest and most cost-effective way to manage data resources is to keep one copy of the data in a central data repository and provide user access for data maintenance and operational GIS query and analysis needs. This is not always practical, and many business operations require that organizations maintain distributed copies of the data. Significant compromises may have to be made to support a distributed data architecture.
This section provides an overview of GIS data management technology patterns. Several basic data management tasks will be identified along with the current state of technology to support these tasks. These data management tasks include ways to manage, serve, move, store, protect, and back-up spatial data.
Management of GIS data resources is slightly different for spatial vector and raster imagery data sources. Imagery is fully integrated into GIS with the ArcGIS 10 release. Vector data is managed as features within the geodatabase and imagery is managed as a Mosaic Dataset. Both are managed within the ArcGIS Desktop ArcCatalog application.
- 1 GIS Spatial Data Architecture Patterns
- 2 Ways to Manage and Access Spatial Data
- 3 Distributed Geodatabase
- 4 Distributed Data Architecture Strategies
- 5 GIS Raster Imagery Data Architecture
- 6 ArcGIS Imagery Access Patterns
- 7 Enterprise GIS Data Management
- 8 Ways to Store Spatial Data
- 9 Ways to Protect Spatial Data
- 10 Ways to Move Spatial Data
- 11 Ways to Back Up Spatial Data
- 12 Data Management Overview
- 13 Previous Editions
GIS Spatial Data Architecture Patterns
Figure 4-1 provides an overview of the GIS spatial data architecture patterns. GIS spatial data include points, polygons, lines and their associated attributes. Data is stored in geodatabase feature tables, where each row in the table represents a spatial feature and its associated attributes. Complex features and the parcel fabric introduced with ArcGIS 10 are special extensions of these same features.
Lidar point elevation and terrain data are currently stored and managed in a geodatabase. Future plans are to manage these data as raster datasets with the ArcCatalog imagery management tools.
GIS Spatial Data Management
The ArcSDE Geodatabase was designed to manage GIS spatial data. The ArcSDE schema provides a multi-versioned data maintenance environment which can include dependencies and relationships to maintain quality and integrity of features entered into the geodatabase. Shape files, CAD files, and location/elevation data are a few examples of data that can be managed in an ArcSDE Geodatabase.
Versioned ArcSDE Geodatabase servers provide the central GIS data repository for many enterprise GIS implementations. These maintenance databases can maintain thousands of concurrent edit sessions (work in progress) in many different states of completion.
GIS Spatial Data Distribution Database
Many large organizations have used geodatabase replication to establish and maintain a separate read only distribution or publishing database that contains only the published data layers in a simple feature format. In many enterprise environments, most workflows require read only access to the spatial data. The simple feature structure improves display performance and server capacity for optimum access by enterprise business users.
ArcGIS Server map caching services make it feasible to create and maintain an optimized map cache of the more static vector basemap layers. This pyramid map cache can be accessed as a preprocessed basemap layer, with the more dynamic operational layers served from the distribution database. GIS client applications will combine the operational business layers with the basemap in the map display. Preprocessed map cache provides the fastest access and the most scalable GIS data source.
ArcGIS Server provides services for Full, Partial, and on demand caching services. ArcGIS 10 provides a compact cache map format that reduces storage volume by up to 90 percent and improves disk access performance. ArcGIS 10 also supports mixed mode cache tile formats which allows most of the imagery to be stored in the highest compressed JPEG format while providing transparent PNG24 format boundary tiles. Mixed mode cache tile formats are important when supporting incremental cache updates to the map cache repository.
Historical records of GIS features are becoming popular as temporal GIS workflow analysis becomes more important as a business information resource. ArcSDE provides a history geodatabase for maintaining time stamped records of geodatabase state changes over time. As feature changes are made to the maintenance database, copy of the feature is replicated to the history database instance as a time stamped record of that state. Temporal views of the history dataset can be accessed based on the time stamp history.
Ways to Manage and Access Spatial Data
Release of ArcGIS technology introduced the ArcSDE geodatabase, which provides a way to manage long transaction edit sessions within a single database instance. ArcSDE supports long transactions using versions (different views) of the database. A geodatabase can support thousands of concurrent versions of the data within a single database instance. The default version represents the real world, and other named versions are proposed changes and database updates in work.
Figure 4-2 shows a typical long transaction workflow life cycle. The workflow represents design and construction of a typical housing subdivision. Several design alternatives might initially be represented as separate named versions in the database to support planning for a new subdivision. One of these designs (versions) is approved to support the construction phase. After the construction phase is complete, the selected design (version) is modified to represent the as-built environment. Once development is completed, the final design version will be reconciled with the geodatabase and posted to the default version to reflect the new subdivision changes.
The simplest way to introduce the versioning concept in the geodatabase is by using some logical flow diagrams. Figure 4-3 demonstrates the explicit state model represented in the geodatabase. The default version lineage is represented in the center of the diagram, and a new default version state is added each time edits are posted to the default view. Each edit post represents a state change in the default view (accepted changes to the real-world view). There can be thousands of database changes (versions) at a time. As changes are completed, these versions are posted to the default lineage.
The new version on the top of the diagram shows the life cycle of a long transaction. The transaction begins as changes from "state 1" of the default lineage. Maintenance updates reflected in that version are represented by new states in the edit session (1a, 1b, and 1c). During the edit session, the default version accepts new changes from other completed versions. The new version active edit session is not aware of the posted changes to the default lineage (2, 3, 4, and 5) since it is referenced from default state 1. Once the new version is complete, it must be reconciled with the default lineage. The reconcile process compares the changes in the new version (1a, 1b, and 1c) with changes in the default lineage (2, 3, 4, and 5) to make sure there are no edit conflicts. If the reconcile process identifies conflicts, these conflicts must be resolved before the new version can be posted to the default lineage. Once all conflicts are resolved, the new version is posted to the default lineage forming state 6.
Figure 4-4 shows a typical workflow history of the default lineage. Named versions (t1, t4, and t7) represent edit transactions in work that have not been posted back to the default lineage. The parent states of these versions (1, 4, and 7) are locked in the default lineage to support the long edit sessions that have not been posted. The default lineage includes several states (2, 3, 5, and 6) that were created by posting completed changes.
Figure 4-5 demonstrates a geodatabase compress. Very long default lineages (thousands of states) can impact database performance. The geodatabase compress function consolidates all default changes into the named version parent states, thus decreasing the length of the default lineage and improving database performance.
Now that the geodatabase versioning concept is understood, it is helpful to recognize how this is physically implemented within the database table structure. The GIS spatial and attribute data are stored in relational database tables. When a feature table within the geodatabase is versioned, two new tables are created to track changes to the base feature table. An Adds Table is created to track additional rows added to the base feature table, and a Deletes Table is created to record deleted rows from the Base Table . Each row in the Adds and Deletes tables represents change states within the geodatabase. As changes are posted to the default version, these changes are represented by pointers in the Adds and Deletes tables. Once there is a versioned geodatabase, the real-world view (default version) is represented by the Base Table plus the Adds and Deletes tables included in the default lineage (the Base Table does not represent default). Figure 4-6 provides a representation of the Base Table, Adds Table, and Deletes Table in a versioned geodatabase.
Most operational versioned geodatabase have a Base Table plus Add and Delete table values as part of the default lineage. All outstanding versions must be reconciled and posted to compress all default changes back to the Base Table (zero state). This is not likely to occur for a working maintenance database in a real-world environment.
The ArcGIS technology includes a spatial database engine (ArcSDE) for managing and sharing GIS data. The ArcSDE and User schema define the geodatabase table structure, relationships, and dependencies. Figure 4-7 provides an overview of the ArcSDE components.
Every Esri software product includes an ArcSDE communications client. The ArcSDE schema includes relationships and dependencies used to manage geodatabase versioning and replication functionality. The ArcSDE schema also includes the geodatabase license code stored in host DBMS tables. ArcSDE also includes an executable that translates communications between ArcGIS ArcObjects and the supported DBMS. The ArcSDE executable is included in the ArcGIS ArcObject DBMS direct connect application program interface (api), and is also available for install on the DBMS server or middle server tier as a separate application executable (GSRVR).
Geodatabase Evolution. ArcSDE has evolved from an initial binary schema and spatial storage types to the current XML schema with SQL spatial storage types. Figure 4-8 shows the evolution cycle improving spatial data access, enhanced performance and scalability, and including a larger collection of supported spatial storage data types.
ArcSDE manages the versioning schema of the geodatabase and supports client application access to the appropriate views of the geodatabase. ArcSDE also supports export and import of data from and to the appropriate database tables and maintains the geodatabase scheme defining relationships and dependencies between the various tables.
ArcGIS Geodatabase Transition
Moving subsets of a single database cannot normally be supported with standard backup strategies. Data must be extracted from the primary database and imported into the remote database to support the data transfer. Database transition can be supported using standard ArcGIS export/import functions. These tools can be used as a method of establishing and maintaining a copy of the database at a separate location. Figure 4-9 identifies ways to move spatial data using ArcGIS data transition functions.
ArcSDE Admin Commands: Batch process can be used with ArcSDE admin commands to support export and import of an ArcSDE database. Moving data using these commands is most practical when completely replacing the data layers. These commands are not optimum solutions when transferring data to a complex ArcSDE geodatabase environment.
ArcCatalog/ArcTools Commands: ArcCatalog supports migration of data between ArcSDE geodatabase environments, extracts from a personal geodatabase, and imports from a personal geodatabase to an ArcSDE environment.
Geodatabase Single-Generation Replication
The ArcGIS 8.3 release introduced a disconnected editing solution. This solution provides a registered geodatabase version extract to a personal geodatabase or separate database instance for disconnected editing purposes. The version adds/deletes values are collected by the disconnected editor and, on reconnecting to the parent server, can be uploaded to the central ArcSDE database as a version update.
Figure 4-10 presents an overview of the ArcGIS 8.3 disconnected editing with checkout to a personal geodatabase (PGD). The ArcGIS 8.3 release is restricted to a single checkout/check-in transaction for each client edit session.
Figure 4-11 presents an overview of the ArcGIS 8.3 disconnected editing with checkout to a separate ArcSDE geodatabase. The ArcGIS 8.3 release is restricted to a single checkout/ check-in transaction for each child ArcSDE database. The child ArcSDE database can support multiple disconnected or local version edit sessions during the checkout period. All child versions must be reconciled before check-in with the parent ArcSDE database (any outstanding child versions will be lost during the child ArcSDE database check-in process).
Geodatabase One-way Multi-generation Replication
The ArcGIS 9.2 software introduced support for incremental updates between ArcSDE geodatabase environments.
Geodatabase Two-way Multi-generation Replication
The ArcGIS disconnected editing functionality was expanded in with the ArcGIS 9 releases to support loosely coupled ArcSDE distributed database environments. Figure 4-13 presents an overview of the loosely coupled ArcSDE distributed database concept.
Multi-generation replication supports a single ArcSDE geodatabase distributed over multiple platform environments. The child checkout versions of the parent database supports an unlimited number of update transactions without losing local version edits or requiring a new checkout. Updates are passed between parent and child database environments through simple datagrams that can be transmitted over standard WAN communications. This new geodatabase architecture supports distributed database environments over multiple sites connected by limited bandwidth communications (only the reconciled changes are transmitted between sites to support database synchronization).
Figure 4-14 provides an overview of common ArcGIS Server geodatabase use case scenarios.
Regional Offices can be supported by two-way multi-generation replication synchronizing with the central Data Center corporate SDE Geodatabase server. Central server maintaining the land base layers and remote offices updating the operational layers.
Mobile users can work in the field, receiving project updates and synchronizing with the central Enterprise geodatabase.
Federated hierarchical data exchange provide incremental updates from local, to state, to federal Geodatabase levels - filtered as appropriate for each level of access.
Distribution (Publication) geodatabase environments can be incrementally updated from Maintenance (Production) database for read only access by large communities of users.
Geodatabase replications establishes a framework for a broad variety of distributed Geodatabase operations.
Distributed Data Architecture Strategies
Geodatabase replication is becoming more important as enterprise organization are expanding and managing their data across multiple data centers. Figure 4-15 shows an Enterprise Data Center supporting a variety of remote site clients (stand alone ArcGIS Desktop, CAD clients, Citrix terminal clients, and mobile GIS viewers). ArcGIS Server can be used to replicate Enterprise GIS data resources from the maintenance database to remote publishing databases maintained at a separate Data Center or in a published cloud computing environment.
Figure 4-16 shows how replication services are leveraged to support a Federated architecture. Regional Data Centers can host maintenance databases from multiple municipal organizations reducing overall administrative costs for the region. Each municipality can publish to their database of record (distribution database) for sharing with the community. Subsets (filtered versions) of the different Municipal publication databases can be integrated at a National or Global level and then published again as a National dataset. Any of the server levels can be hosted by private Data Centers, private cloud hosting providers, or the final copy published on a public cloud hosting facility. All of this is made possible with ArcGIS Server Geodatabase replication services.
GIS Raster Imagery Data Architecture
Figure 4-17 provides an overview of the GIS image data architecture patterns. GIS image data include Arial Photography and Satellite Imagery delivered in a variety of storage formats (TIFF, IMG, MRSID, JPG2000, etc). Data is stored in its delivered source format. This is important, since every time you manipulate or change the imagery format you lose quality. Imagery can be quite large (often measured in 100s of Gigabytes, Terabytes, or even Petabytes of data). There is a real advantage in moving the data "as is" directly to your storage environment.
Lidar point elevation and terrain data are currently stored in a geodatabase and managed with the vector data. Future plans are to manage these data as raster datasets with ArcCatalog imagery management tools.
GIS Raster Data Management
The Mosaic Dataset was designed to manage GIS raster data. A Mosaic Dataset is created using ArcGIS Desktop ArcCatalog and provides on-the-fly processing of the raw imagery data sources. The Mosaic Dataset is created within a host geodatabase.
GIS Raster Data Access
Imagery can be accessed from ArcGIS Desktop or through ArcGIS Server Image Services. ArcGIS Desktop clients have full access to imagery through the Mosaic Dataset. The ArcGIS Server Image Service can access single image catalogs directly. ArcGIS Server Image Extension enables image service access to Imagery through a published Mosaic Dataset.
ArcGIS Server map caching makes it feasible to create and maintain an optimized map cache of the Imagery layers. The imagery map cache is an optimized tiled layer including a pyramid of standard map scales. Imagery map cache can be accessed as seamless preprocessed map tiles, brought together as an Imagery basemap. GIS client application overlays dynamic business layers (spatial data) over the Imagery basemap in the map display. Preprocessed map cache image pyramids provide the fastest access and the most scalable GIS data source.
ArcGIS Server provides Full, Partial, and on demand caching services. ArcGIS 10 includes a compact map cache format that reduces storage volume by up to 90 percent and improves disk access. ArcGIS 10 supports mixed mode cache tile formats which allow most of the imagery to be stored in the highest compressed JPEG format with transparent PNG24 format boundary tiles. Mixed mode cache tile formats are important for incremental cache updates to the Imagery map cache repository.
Historical Imagery Online Management
Historical records of GIS features are becoming popular as temporal GIS workflow analysis becomes more important as a business information resource. Imagery metadata formats include time stamped images. Storage vendors provide highly scalable content storage solution that can manage access and protection of file data sources over long periods of time. These storage solutions are combined with hierarchical storage management, moving inactive data files to lower cost media based on usage requirements. Hierarchical storage management in conjunction with content storage technology is an evolving solution for online temporal access to unlimited capacity Imagery archives.
ArcGIS Imagery Access Patterns
Once Imagery is available on a local network file share, ArcGIS Desktop can be used to author a Mosaic Dataset for multiple image data sources or establish an Image Catalog for a single Image data source. Imagery can be published and accessed by ArcGIS Desktop clients or through ArcGIS Server Image Services.
What is a Mosaic Dataset?
Figure 4-18 provides an overview of the Mosaic Dataset. The Mosaic Dataset is a catalog of Imagery and rasters, associated metadata, and processing functions for managing access to online raster data sources. The Mosaic Dataset is stored in a geodatabase and authored using ArcGIS Desktop. The processing functions enable dynamic mosaicking and on-the-fly imagery processing.
Direct Image Access
Figure 4-19 shows ArcGIS Desktop access to read/write to individual Image files using a traditional workstation. Direct access is available for a variety of Image formats, including TIF, IMG, and MrSID. ArcGIS Desktop can also create a Mosaic Dataset of local Image resources for access and management of the available Imagery inventory.
ArcGIS Server Image Service
Figure 4-20 shows options available for accessing Imagery through ArcGIS Server Image Service. The ArcGIS Server Image Service provides direct access to preprocessed single Image Datasets. The ArcGIS Server Image Extension expands Image Services to include on-the-fly processing of multiple Imagery resources utilizing a published Mosaic Dataset.
Cached Static Imagery Services
Figure 4-21 shows access to a cached image service. ArcGIS Desktop or ArcGIS Server can be used to create and maintain an Imagery Cache. Imagery cache is read only – what you cache is what you get. Imagery is preprocessed and tiled for high performance. This is the same format used by Google and Bing maps – with world-wide coverage of Bing Maps available online through ArcGIS.com.
Recommended Imagery Workflow
Managing your imagery resources is becoming increasingly important for most organizations. People expect to see Imagery as a background option in their map display – it is how we relate to our world. Figure 4-22 what is what we recommend for managing your Imagery inventory?
Imagery is provided by Satellite or aerial photography suppliers. Once Imagery is loaded on your local network, use ArcGIS Desktop to author a Mosaic Dataset of your imagery. The Mosaic Dataset provides local access to your imagery. ArcGIS Desktop clients can access the Mosaic Dataset and leverage on-the-fly processing of multiple image data sources. ArcGIS Server can provide image services from a preprocessed Imagery dataset.
ArcGIS Server Image Extension can leverage the mosaic dataset for dynamic on-the-fly processing of multiple imagery data sources, providing a full range of image services to local desktop and distributed Web clients. ArcGIS Server can also be used to create a map cache basemap for serving the larger Web community.
When receiving imagery updates, use ArcGIS Desktop to register updates with the Mosaic Dataset and ArcGIS Server to update the Image map cache basemap.
Enterprise GIS Data Management
Enterprise GIS Data often includes a mixture of vector business layers, more stable reference land base layers, and imagery. Figure 4-23 provides a composite overview of enterprise GIS data management, including management options for map features (business and land base layers) and raster data (dynamic and static). Historical feature and imagery data is also becoming increasingly important in supporting business analysis needs. ArcGIS provides a full range of tools for managing your data for optimum use by your organization.
Ways to Store Spatial Data
Storage technology has evolved over the past 20 years to improve data access and provide better management of available storage resources. Understanding the advantages of each technical solution will help you select the storage architecture that best supports your needs.
Evolution of Storage Area Networks
Figure 4.24 provides an overview of the evolution of traditional storage from internal workstation disk to the storage area network architecture.
Internal Disk Storage. The most elementary storage architecture puts the storage disk on the local machine. Most computer hardware today includes internal disk for use as the storage medium. Workstations and servers can both be configured with internal disk storage. The fact that access to it is through the local workstation or server can be a significant limitation in a shared server environment: if the server operating system goes down, there is no way for other systems to access the internal data resources.
File server storage provides a network share that can be accessed by many client applications within the local network. Disk mounting protocols (NFS and CIFS) provide local application access over the network to the data on the file server platform. Query processing is provided by the application client, which can involve a high amount of chatty communications between the client and server network connection.
Database server storage provides query processing on the server platform, which significantly reduces the required chatty network communications. Database software also improves data management by providing better administration and control of the data.
Internal storage can include RAID mirror disk volumes that will preserve the data store in the event of a single disk failure. Many servers include storage trays that provide multiple disk drives for configuring RAID 5 configurations and facilitate high capacity storage needs. The internal storage access is limited to the host server, so as many data center environments grew larger in the 1990s customers would have many servers in their data center with too much disk (disk not being used), and other servers with too little disk making disk volume management a challenge (data volumes could not be shared between server internal storage volumes). External storage architecture (Direct Attached, Storage Area Networks, and Network Attached Storage) provides a way for organizations to “break out” from these “silo based” storage solutions and build a more manageable and adaptive storage architecture.
Direct Attached Storage. A direct attached storage (DAS) architecture provides the storage disk on an external storage array platform. Host bus adaptors (HBA) connect the server operating system to the external storage controller using the same block level protocols that were used for Internal Disk Storage, so from an application perspective the direct attached storage appears and functions the same as internal storage. The external storage arrays can be designed with fully redundant components (system would continue operations with any single component failure), so a single storage array product can satisfy high available storage requirements.
Direct attached storage technology would often provide several fiber channel connections between the storage controller and the server HBAs. For high availability purposes, it is standard practice to configure two HBA fiber channel connections for each server environment. Typical Direct Attached Storage solutions would provide from 4 to 8 fiber channel connections, so you can easily connect up to 4 servers each with redundant fiber channel connections from a single direct connect storage array controller. Multiple disk storage volumes are configured and assigned to each specific host server, and the host servers would have full access control to the assigned storage volumes. In a server failover scenario, the primary server disk volumes can be reassigned to the failover server.
Storage Area Networks. The difference between direct attached storage and a storage area network is the introduction of a Fiber Channel Switch to establish network connectivity between multiple Servers and multiple external Storage Arrays. The storage area network (SAN) improves administrative flexibility for assigning and managing storage resources when you have a growing number of server environments. The Server HBAs and the External Storage Array controllers are connected to the Fiber Channel Switch, so any Server can be assigned storage volumes from any Storage Array located in the storage farm (connected through the same storage network). Storage protocols are still the same as with Direct Attached or Internal Storage – so from a software perspective, these storage architecture solutions appear the same and are transparent to the application and data interface.
Evolution of Network Attached Storage
Network Attached Storage. By the late 1990s, many data centers were using servers to provide client application access to shared file data sources. High available environments would require complicated failover clustered file servers, so if one of the file servers fail users would still have access to the file share. Hardware vendors decided to provide a highbred appliance configuration to handle these network file shares (called Network Attached Storage or NAS) – the network attached storage incorporates a file server and storage in a single consolidated high available storage platform. The file server can be configured with a modified operating system that provides both NFS and CIFS disk mount protocols, and a storage array with this modified file server network interface is deployed as a simple network attached storage appliance. The storage appliance includes a standard Network Interface Card (NIC) interface to the local area network, and client applications can connect to the storage appliance file shares over standard disk mount protocols. The network attached storage provided a very simple way to deploy a high capacity network file share for access by a large number of UNIX and/or Windows network clients. Figure 4-25 shows the evolution of the Network Attached Storage architecture.
Network attached storage provides a very effective architecture alternative for supporting network file shares, and has become very popular among many GIS customers. When GIS data migrated from early file based data stores (coverages, LIBRARIAN, ArcStorm, Shapefiles) to a more database centric data management environment (ArcSDE Geodatabase servers), the network attached storage vendors suggested customers could use a network file share to support database server storage. There were some limitations: It is important to assign dedicated data storage volumes controlled by each host database server to avoid data corruption. Other limitations include slower database query performance due to chatty IP disk mount protocols and bandwidth over the IP network was lower than the Fiber Channel switch environments (1 Gbps IP networks vs 2 Gbps Fiber Channel networks) – implementation of Network Attached Storage as an alternative to Storage Area Networks was not an optimum storage architecture for geodatabase server environments. Network attached storage was an optimum architecture for file based data sources and use of the NAS technology alternative continued to grow.
Because of the simple nature of network attached storage solutions, you can use a standard local area network (LAN) Switch to provide a network to connect your servers and storage solutions; this is a big selling point for the NAS proponents. There is quite a bit of competition between Storage Area Networks and Network Attached Storage technology, particularly when supporting the more common database environments. The SAN community will claim their architecture has higher bandwidth connections and uses standard storage block protocols. The NAS community will claim they can support your storage network using standard LAN communication protocols and provide support for both database server and network file access clients from the same storage solution.
The network attached storage community eventually introduced a more efficient iSCSI communication protocol for IP network storage networks (SCSI storage protocols over IP networks). GIS architectures today include a growing number of file data sources (examples include ArcGIS Server Image Extention, ArcGIS Server pre-processed 2-D and 3-D map cache, and file geodatabase). For many GIS operations, a blend of these storage technologies (SAN and NAS) provides the optimum storage solution. Introduction of ISCSI over an IP switched storage network architecture provides an attractive compromise for mixed DBMS/File Share storage requirements.
Ways to Protect Spatial Data
Enterprise GIS environments depend heavily on GIS data to support a variety of critical business processes. Data is one of the most valuable resources of a GIS, and protecting data is fundamental to supporting critical business operations.
The primary data protection line of defense is provided by the storage solutions. Most storage vendors have standardized on redundant array of independent disks (RAID) storage solutions for data protection. A brief overview of basic storage protection alternatives includes the following:
Just a Bunch of Disks (JBOD): A disk volume with no RAID protection is referred to as just a bunch of disks configuration, or (JBOD). This represents a configuration of disks with no protection and no performance optimization.
RAID 0: A disk volume in a RAID 0 configuration provides striping of data across several disks in the storage array. Striping supports parallel disk controller access to data across several disks reducing the time required to locate and transfer the requested data. Data is transferred to array cache once it is found on each disk. RAID 0 striping provides optimum data access performance with no data protection. One hundred percent of the disk volume is available for data storage.
RAID 1: A disk volume in a RAID 1 configuration provides mirror copies of the data on disk pairs within the array. If one disk in a pair fails, data can be accessed from the remaining disk copy. The failed disk can be replaced and data restored automatically from the mirror copy without bringing the storage array down for maintenance. RAID 1 provides optimum data protection with minimum performance gain. Available data storage is limited to 50 percent of the total disk volume, since a mirror disk copy is maintained for every data disk in the array.
RAID 3 and 4: A disk volume in a RAID 3 or RAID 4 configuration supports striping of data across all disks in the array except for one parity disk. A parity bit is calculated for each data stripe and stored on the parity disk. If one of the disks fails, the parity bit can be used to recalculate and restore the missing data. RAID 3 provides good protection of the data and allows optimum use of the storage volume. All but one parity disk can be used for data storage, optimizing use of the available disk volume for data storage capacity.
There are some technical differences between RAID 3 and RAID 4, which, for our purposes, are beyond the scope of this discussion. Both of these storage configurations have potential performance disadvantages. The common parity disk must be accessed for each write, which can result in disk contention under heavy peak user loads. Performance may also suffer because of requirements to calculate and store the parity bit for each write. Write performance issues are normally resolved through array cache algorithms on most high-performance disk storage solutions.
The following RAID configurations are the most commonly used to support ArcSDE storage solutions. These solutions represent RAID combinations that best support data protection and performance goals. Figure 4-26 provides an overview of the most popular composite RAID configuration.
RAID 1/0: RAID 1/0 is a composite solution including RAID 0 striping and RAID 1 mirroring. This is the optimum solution for high performance and data protection. This is also the costliest solution. Available data storage is limited to 50 percent of the total disk volume, since a mirror disk copy is maintained for every data disk in the array.
RAID 5: RAID 5 includes the striping and parity of the RAID 3 solution and the distribution of the parity volumes for each stripe across the array to avoid parity disk contention performance bottlenecks. This improved parity solution provides optimum disk utilization and near optimum performance, supporting disk storage on all but one parity disk volume.
Hybrid Solutions: Some vendors provide alternative proprietary RAID strategies to enhance their storage solution. New ways to store data on disk can improve performance and protection and may simplify other data management needs. Each hybrid solution should be evaluated to determine if and how it may support specific data storage needs.
ArcSDE data storage strategies depend on the selected database environment.
SQL Server: Log files are located on RAID 1 mirror, and index and data tables are located on RAID 5 disk volume.
Oracle, Informix, and DB2: Index tables and log files are located on RAID 1/0 mirror, and striped data volumes and data tables are located on RAID 5.
Potential Storage Contention Concerns
Technology is improving display performance and moving more data faster and more efficiently throughout the server, network, and storage infrastructure. For GIS, computer processing speed has traditionally limited how fast we use our data resources. As processing speed and data traffic increases, GIS data access can place new demands on storage subsystems. Figure 4-27 highlights some potential storage performance concerns and identifies some available solutions to address these concerns.
Most storage solutions today use mechanical arms to access data stored on a rotating disk platter. Disk rotation speed has stayed about the same for a long time (over 10 years). Storage capacity has increased within the same disk volume as media density improves while the average disk input/output (I/O) access speed for these mechanical drives has not seen much improvement.
Disk contention is more likely today due to faster display processing times, higher capacity servers, and larger peak concurrent GIS user loads. The larger disk volumes mean that data is not as spread out across as many disks as was the case with smaller drives. Other factors, such as disk fragmentation caused by rapid data update cycles and larger volumes of storage, reduce efficiency in accessing data and increase the potential for data contention (I/O backups).
GIS map cache data sources remove the need for server processing, further increasing server capacity and the volume of requests for cached data resources. We have preprocessed a significant portion of the traditional GIS data resources; access to preprocessed data requires very little processing.
There are some characteristics about data caching that help avoid disk contention. When a cached file is requested by a client application, the file content will be cached on its way to the client display. If the same file is requested again, it can be delivered from cache and avoid additional requests to the disk level. Inline caching reduces disk contention risk when there are large demands for exactly the same file – the file can be delivered from distributed cached sources without returning to the original disk. This makes it difficult to identify just when disk contention might become a problem – it all depends on the types of user workflows accessing the data on the same disk and how the data might be cached in the system.
The good news is that there are technical solutions available when we do experience disk contention, and also it is quite simple to monitor disk I/O performance and identify if disk contention is a problem. A variety of vendor solutions are available that accelerate Web access by caching data files on edge servers or Web accelerator appliances (servers located near the site communication access points). Disk access can also be improved with RAID data stripping – distributing the data across larger RAID volumes can reduce the probability of disk contention. Solid state disk drives are available in the current marketplace; solutions that deliver data over 1000 times faster than current mechanical disk drives. Cost for solid state drives is currently much higher than their mechanical counterparts – this can change as the solid state market sales volume increases (vendors are waiting for an opportunity to upgrade our storage solutions).
So what should we do to minimize disk contention risk? Be aware of the potential for performance bottlenecks within your storage environment. Monitor disk I/O performance to identify when disk contention is causing a performance delay. Be aware that there are solutions to disk I/O performance problems, and take appropriate action to address performance issues when they occur.
Ways to Move Spatial Data
Many enterprise GIS solutions require continued maintenance of distributed copies of the GIS data resources, typically replicated from a central GIS data repository or enterprise database environment. Organizations with a single enterprise database solution still have a need to protect data resources in the event of an emergency such as fire, flood, accidents, or other natural disasters. Many organizations have recently reviewed their business practices and updated their plans for business continuance in the event of a major loss of data resources. The tragic events of September 11, 2001, demonstrated the value of such plans and increased interest and awareness of the need for this type of protection.
This section reviews the various ways organizations move spatial data. Traditional methods copy data on tape or disk and physically deliver this data to the remote site through standard transportation modes. Once at the remote site, data is reinstalled on the remote server environment. Technology has evolved to provide more efficient alternatives for maintaining distributed data sources. Understanding the available options and risks involved in moving data is important in defining optimum enterprise GIS architecture.
Traditional Data Transfer Methods
Figure 4-28 identifies traditional methods for moving a copy of data to a remote location.
Traditional methods include backup and recovery of data using standard tape or disk transfer media. Moving data using these methods is commonly called "sneaker net." These methods provide a way to transfer data without the support of a physical network.
Tape Backup: Tape backup solutions can be used to move data to a separate server environment. Tape transfers are normally very slow. The reduced cost of disk storage has made disk copy a much more feasible option.
Disk Copy: A replicated copy of the database on disk storage can support rapid restore at a separate site. The database can be restarted with the new data copy and online within a short recovery period.
Customers have experienced a variety of technical challenges when configuring DBMS spatial data replication solutions. ArcSDE data model modifications may be required to support DBMS replication solutions. Edit loads will be applied to both server environments, contributing to potential performance or server sizing impacts. Data changes must be transmitted over network connections between the two servers, causing potential communication bottlenecks. These challenges must be overcome to support a successful DBMS replication solution. Customers have indicated that DBMS replication solutions can work but require a considerable amount of patience and implementation risk. Acceptable solutions are available through some DBMS vendors to support replication to a read-only backup database server. Dual master server configurations significantly increase the complexity of an already complex replication solution. Figure 4-29 presents the different ways to move spatial data using database replication.
Synchronous Replication. Real-time replication requires commitment of data transfer to the replicated server before releasing the client application on the primary server. Edit operations with this configuration would normally result in performance delays because of the typical heavy volume of spatial data transfers and the required client interaction times. High-bandwidth fiber connectivity (1000 Mbps bandwidth) is recommended between the primary server and the replicated backup server to minimize performance delays.
Asynchronous Replication. Near real-time database replication strategies decouple the primary server from the data transfer transaction to the secondary server environment. Asynchronous replication can be supported over WAN connections, since the slow transmission times are isolated from primary server performance. Data transfers (updates) can be delayed to off-peak periods if WAN bandwidth limitations dictate, supporting periodic updates of the secondary server environment at a frequency supporting operational requirements.
Disk-level replication is a well-established technology, supporting global replication of critical data for many types of industry solutions. Spatial data is stored on disk sectors very similar to any other data storage and, as such, does not require special attention beyond what might be required for other data types. Disk volume configurations (data location on disk and what volumes are transferred to the remote site) may be critical to ensure database integrity. Mirror copies are refreshed based on point-in-time snapshot functions supported by the storage vendor solution. Disk-level replication provides transfer of block-level data changes on disk to a mirror disk volume located at a remote location. Transfer can be supported with active online transactions with minimum impact on DBMS server performance capacity. Secondary DBMS applications must be restarted to refresh the DBMS cache and processing environment to the point in time of the replicated disk volume. Figure 4-30 presents the different ways to move spatial data using disk level replication.
Synchronous Replication Real-time replication requires commitment of data transfer to the replicated storage array before releasing the DBMS application on the primary server. High-bandwidth fiber connectivity (1000 Mbps bandwidth) is recommended between the primary server and the replicated backup server to avoid performance delays.
Asynchronous Replication: Near real-time disk-level replication strategies decouple the primary disk array from the commit transaction of changes to the secondary storage array environment. Asynchronous replication can be supported over WAN connections, since the slow transmission times are isolated from primary DBMS server performance. Disk block changes can be stored and data transfers delayed to off-peak periods if WAN bandwidth limitations dictate, supporting periodic updates of the secondary disk storage volumes to meet operational requirements.
Ways to Back Up Spatial Data
Data protection at the disk level minimizes the need for system recovery in the event of a single disk failure but will not protect against a variety of other data failure scenarios. It is always important to keep a current backup copy of critical data resources, and maintain a recent copy at a safe location away from the primary site. Figure 4-31 highlights data backup strategies available to protect your business operations.
The type of backup system you choose for your business will depend on your business needs. For simple low priority single use environments, you can create a periodic point-in-time backup on a local disk or tape drive and maintain a recent off-site copy of your data for business recovery. For larger enterprise operations, system availability requirements may drive requirements for failover to backup Data Centers when the primary site fails. Your business needs will drive the level of protection you need.
Data backups provide the last line of defense for protecting our data investments. Careful planning and attention to storage backup procedures are important factors to a successful backup strategy. Data loss can result from many types of situations, with some of the most probable situations being administrative or user error.
Host Tape Backup: Traditional server backup solutions use lower-cost tape storage for backup. Data must be converted to a tape storage format and stored in a linear tape medium. Backups can be a long drawn out process taking considerable server processing resource (typically consume a CPU during the backup process) and requiring special data management for operational environments.
For database environments, point-in-time backups are required to maintain database continuity. Database software provide for online backup requirements by enabling a procedural snapshot of the database. A copy of the protected snapshot data is retained in a snapshot table when changes are made to the database, supporting point-in-time backup of the database and potential database recovery back to the time of the snapshot.
Host processors can be used to support backup operations during off-peak hours. If backups are required during peak-use periods, backups can impact server performance.
Network Client Tape Backup: The traditional online backup can often be supported over the LAN with the primary batch backup process running on a separate client platform. DBMS snapshots may still be used to support point-in-time backups for online database environments. Client backup processes can contribute to potential network performance bottlenecks between the server and the client machine because of the high data transfer rates during the backup process.
Storage Area Network Client Tape Backup: Some backup solutions support direct disk storage access without impacting the host DBMS server environment. Storage backup is performed over the SAN or through a separate storage network access to the disk array with batch process running on a separate client platform. A disk-level storage array snapshot is used to support point-in-time backups for online database environments. Host platform processing loads and LAN performance bottlenecks can be avoided with disk-level backup solutions.
Disk Copy Backup: The size of databases has increased dramatically in recent years, growing from tens of gigabytes to hundreds of gigabytes and, in many cases, terabytes of data. Recovery of large databases from tape backups is very slow, taking days to recover large spatial database environments. At the same time, the cost of disk storage has decreased dramatically providing disk copy solutions for large database environments competitive in price to tape storage solutions. A copy of the database on local disk, or a copy of these disks to a remote recovery site, can support immediate restart of the DBMS following a storage failure by simply restarting the DBMS with the backup disk copy.
There are several names for disk backup strategies (remote backup, disaster recovery plan, business continuance plan, continuation of business operations plan, etc). The important thing is that you consider your business needs, evaluate risks associated with loss of data resources, and establish a formal plan for business recovery in the event of data loss.
Data Management Overview
Support for distributed database solutions has traditionally introduced high-risk operations, with potential for data corruption and use of stale data sources in GIS operations. There are organizations that support successful distributed solutions. Their success is based on careful planning and detailed attention to their administrative processes that support the distributed data sites. More successful GIS implementations support central consolidated database environments with effective remote user performance and support. Future distributed database management solutions may significantly reduce the risk of supporting distributed environments. Whether centralized or distributed, the success of enterprise GIS solutions will depend heavily on the administrative team that keeps the system operational and provides an architecture solution that supports user access needs.
The next chapter on performance fundamentals will focus on understanding the technology, presenting the fundamental terms and relationships used in system architecture design capacity planning.