GIS Data Administration 33rd Edition
Fall 2013 GIS Data Administration 33rd Edition
Data provides the resources you need to make proper business decisions. The information products required to make business decisions determine the critical data resources that must be available for business operations. How you organize and maintain your data resources will contribute to system performance and user productivity.
How GIS data is managed has changed dramatically over the past 10 years. Much of this change is driven by technology. The big focus in the 1990s was to move GIS data resources together in an SDE geodatabase, where users could better manage and share enterprise data resources. Data management today includes multiple publication formats to improve display performance and capture change over time.
A variety of data management and distribution strategies are available today to improve data access and dissemination throughout the rapidly expanding GIS user community. The volume of data you must sort through each day is growing exponentially. How you manage, organize, and control these data resources is critical to your success.
- 1 GIS feature data architecture
- 2 ArcSDE Geodatabase
- 3 Geodatabase replication use-cases
- 3.1 Distributed enterprise architecture strategies
- 3.2 Mobile operations
- 3.3 Production/publication operations
- 3.4 Extract/transform/load operations
- 3.5 Distributed geodatabase operations
- 3.6 Hierarchical operations
- 4 GIS imagery data architecture
- 5 CPT Platform Capacity Calculator custom imagery services
- 6 GIS enterprise data architecture
- 7 Storage architecture options
- 8 Ways to move GIS data
- 9 Protect your GIS data resources
- 10 Data Management Overview
- 11 CPT Video: GIS data source
- 12 Previous Editions
GIS feature data architecture
Figure 5.1 shows the data source architecture patterns available to manage GIS feature data. GIS feature data includes points, polygons, lines, complex features, and associated attributes. Additional content may include parcel fabric, cartographic representations, lidar point elevation, terrain data, etc.
Geospatial data is the core integration of business intelligence. How you organize your data contributes directly to your business complexity and drives the performance of your business operations. Good data management empowers your ability to make proper business decisions.
There are only two kinds of data: useful data and useless data. Useful data is what you use to create business information products and enable informed business decisions. Useless data is what you do not use, and can rapidly increase the complexity of our data repository.
Best practice: Data should be organized and managed to empower proper business decisions and optimize user productivity.
GIS Feature Data Production Database
A production data source is an ArcSDE geodatabase used to organize and manage your geospatial feature data resources.
- Maintains a complete collection of all critical geospatial feature data.
- Includes schema for managing and validating accuracy and integrity of spatial feature edits.
- Includes functional dependencies and relationships between feature datasets.
- Manages multi-user versioned edit operations.
- Includes all work in progress (versions) along with published datasets (DEFAULT).
Best practice: A production data source provides a single integrated repository for all enterprise-level geospatial feature data resources.
GIS Feature Data Publication Database
A publication geodatabase is an ArcSDE or file geodatabase used to optimize distribution of finalized geospatial data resources.
- Includes read-only simple feature geodatabase.
- Provides simple feature format that improves display performance and system capacity.
- Provides separate distribution access layer improving data security.
- Provides optimum distribution format for operational geospatial layers.
Best practice: Distribute geospatial business operational layers in a publication geodatabase.
GIS Feature Data Map Cache
A feature data map cache is a collection of preprocessed tiled map images stored at multiple scales for rapid dissemination.
- Combines multiple geospatial layers into multiple levels of single read-only preprocessed map tiles.
- Configured in a pyramid tile structure at standardized projection and map scales.
- Cached format delivers map tiles with negligible processing overhead.
- Structured tile format enables client browser caching for high display performance.
- Optimum distribution format for static basemap layers.
Best practice: Distribute geospatial static basemap layers in a preprocessed map cache.
GIS Feature Data Archiving
Geodatabase (GDB) archiving (shown as an available component of the production data source) includes functionality to record and access changes made to all or a subset of data in a versioned geodatabase.
- Provides a mechanism for capturing, managing, and analyzing data change.
- Creates and maintains a separate feature class schema associated with the versioned geodatabase.
- When enabled, maintains all changes saved or posted to the DEFAULT version in an associated archive class.
- Enables temporal analysis of geospatial resources over time.
The CPT Platform Capacity Calculator is a simple tool for evaluating selected platform capacity. The default tool, located at the bottom of the CPT Hardware tab, includes a variety of standard workflows that demonstrate platform capacity. For analysis and reporting purposes, you may want to change the default list of sample workflows and include those workflows you are evaluating in your own design environment. This link describes how you can change the Platform Capacity Calculator workflow samples to a custom set of workflows for demonstration purposes.
Release of ArcGIS technology introduced the ArcSDE geodatabase, which provides a way to manage long transaction edit sessions within a single database instance. ArcSDE supports long transactions using versions (different views) of the database. A geodatabase can support thousands of concurrent versions of the data within a single database instance. The default version represents the real world, and other named versions are proposed changes and database updates in work.
What is versioning?
Geodatabase versioning allows multiple users to edit the same data in an ArcSDE geodatabase without applying locks or duplicating data. Figure 5.2 provides a drawing of a versioned geodatabase workflow.
Users always access an ArcSDE geodatabase through a version. When you connect to a multiuser geodatabase, you specify the version to which you will connect. By default, you connect to the DEFAULT version.
Best practice: Use a versioned geodatabase when managing multiple edit sessions of common feature datasets over time.
Geodatabase versioning example
GIS users have many use-cases in which long transaction workflows are critical. Figure 5.3 shows a long transaction workflow for developing a new community housing subdivision.
A new housing subdivision is being approved by the city.
- City submits requests for design proposals for the new subdivision.
- City planning establishes edit sessions and develops multiple subdivision proposals.
- Subdivision design proposals are provided to the city council for review and approval.
- Design is selected and approved for construction.
- New housing subdivision is constructed over a two-month period.
- City planning updates design for as-built subdivision.
- New housing subdivision is posted to DEFAULT for publishing and distribution.
ArcSDE explicit state model
Figure 5.4 shows the progress of an versioned workflow edit session over time.
The diagram shows DEFAULT version lineage and new version lineages.
- The DEFAULT version is the current "public" end-user view of the geodatabase.
- Lineage refers to the version states as they are updated. Each update provides an additional state to the version lineage.
- The edit version represents the Edit state lineage.
- A new user edit session is started from DEFAULT lineage state 1.
- User edit session saves state 1a and 1b when completing new housing subdivision design.
- Other user edit sessions are posted to DEFAULT, creating states 2, 3, and 4.
- User edit session completes the as-built design and begins to reconcile process.
- Any row conflicts (deletions, additions, or modifications by other user edit sessions) are identified during the reconcile process.
- Once conflicts are resolved, the edit session can be posted to DEFAULT, creating new state 6.
ArcSDE version state tuning
Figure 5.5 shows the ArcSDE Geodatabase DEFAULT version state tree. Enterprise ArcSDE production database maintenance environments often support many GIS editors posting many changes to the geodatabase DEFAULT lineage over time. In many scenarios, the DEFAULT lineage tree can rapidly grow to hundreds and even thousands of state changes. Many of the states are redundant and can cause database tables to grow in size. Reducing the size of these tables can improve database performance and reduce maintenance overhead.
Compressing the state tree removes each state that is not currently referenced by a version and is not the parent of multiple child states. When executed by the ArcSDE administrator, all states that meet the compression criteria are removed, regardless of owner. All other users can compress only the states that they own. This operation reduces the depth of the state tree, shortening the lineage and improving query performance.
Best practice: DEFAULT state tree should be compressed on a periodic schedule to maintain optimum query performance.
You can also trim the state tree, which collapses a linear branch of the tree. This is done by reconciling and saving an edit lineage to a more current state (for example, version t1 could be reconciled and saved to reference state 7, freeing state 1 for compression). Trimming database states will reduce the depth of the state tree, shortening the lineage and improving query performance.
Best practice: The DEFAULT state tree may need to be trimmed to reduce number of long transaction reference states for optimum query performance.
Versioned geodatabase view
Figure 5.6 shows the key database tables supporting a versioned geodatabase view. Several additional tables are included in a versioned geodatabase. ArcSDE uses the additional tables to manage multiple concurrent edit sessions and query access to different views of the data.
For example, a user query of the current DEFAULT view would include features (rows) in the Base table, plus any posted rows from the Adds table, minus any posted rows from the Deletes table. This same approach would be used to view open Edit versions, with ArcSDE sorting out what features were required to populate the version view.
Versioning managed by ArcSDE schema
Figure 5.7 shows the ArcSDE geodatabase schema. The ArcSDE geodatabase includes the ArcSDE schema and the user schema. The ArcSDE geodatabase license key must be installed with the ArcSDE schema.
The ArcSDE schema is used to manage operations within the versioned database environment.
- Query views to DEFAULT or to specified open edit versions of the database.
- Reconcile and post operations related to the various versioned states.
- Geodatabase archive schema for managing state history.
- Geodatabase replication services for managing versioned updates sent to distributed geodatabase sources.
- Geodatabase licensing to protect geodatabase functional integrity.
Note: Multiple user schema instances were supported with the ArcGIS 9.2 release.
ArcSDE schema has evolved
The ArcSDE geodatabase has evolved over the past several ArcGIS software releases. Figure 5.8 shows some of the key evolution milestones.
ArcSDE geodatabase evolution:
- ArcSDE and User Data schema were stored as database binary files through the ArcGIS 9.0 release.
- Database SQL types were developed to manage user data by the ArcGIS 9.2 release.
- ArcSDE schema was upgraded to an XML format with the ArcGIS 10.0 release.
- More open access.
- Faster browsing and searching.
- Improved scalability with large numbers of datasets.
- Foundation to support larger collection of dataset types in the future.
Geodatabase replication use-cases
ArcSDE manages the versioning schema of the geodatabase and supports client application access to the appropriate views of the geodatabase. ArcSDE also supports export and import of data from and to the appropriate database tables and maintains the geodatabase scheme defining relationships and dependencies between the various tables.
ArcSDE provides a variety of data replication options associated with a versioned geodatabase as shown in Figure 5.9.
Note: Replication is the process of sharing data so as to ensure consistency between redundant sources. Geodatabase replication provides filtered data sharing at the version level.
- Check-out and synchronization with mobile laptop clients
- Provisioning and synchronization with ArcGIS Windows Mobile clients
- Incremental updates to an ArcSDE geodatabase publishing database
- Incremental updates to a file geodatabase publishing environment
- Corporate production ArcSDE geodatabase synchronized with multiple remote office production ArcSDE geodatabases
- Centralized scalable ArcSDE distributed geodatabase architecture sharing a common storage area network
- Local production geodatabase sharing versions of their data for regional and national operations
- Global remote operations exchanging versions of their data with regional and corporate management operations
Distributed enterprise architecture strategies
Enterprise GIS operations often include a variety of geodatabase replication functions as shown in Figure 5.10.
The four red arrows on the chart show the use of geodatabase replication services.
- Remote sites synchronizing operations with the central enterprise production (maintenance) geodatabase.
- Mobile operations are sometimes connected (synchronized) with the central enterprise production geodatabase.
- Enterprise production geodatabase replicating to a publication database. Publication database could be maintained locally or in a cloud hosting facility.
- Remote ArcSDE geodatabase production servers can replicate to a publishing geodatabase in a cloud hosting facility.
Mobile field operations are a big part of GIS workflows within most organizations.
- Many mobile operators do not have direct input to the computerized business workflow processes.
- Often information collected in the field must be entered into the system once the users return to the office.
- In many cases, there is a staff of editors who enter material from marked-up hardcopy provided by the field operators.
Best practice: Mobile operations provide a way to integrate field operators into the computerized business workflow processes, improving business efficiency and reducing effort required to get data into the computerized systems.
ArcSDE geodatabase replication support for disconnected editing operations.
Figure 5.11 shows ArcSDE geodatabase replication support for disconnected desktop SDE geodatabase client editing operations.
Note: Disconnected editing extends the geodatabase to provide clients with the capability to perform edit operations in the field when not connected to the central geodatabase.
Check-out operations were initially supported with the ArcGIS 8.3 release.
- ArcGIS editor opens edit session with production geodatabase (creates edit version).
- Check-out of work area operational layers to desktop SDE geodatabase (e.g., SQL Express).
- Check-out reference layers to mobile file geodatabase.
- Complete disconnected field edit operations.
- Check-in field edits to version on central production geodatabase on return.
ArcSDE geodatabase replication support for disconnected Workgroup SDE geodatabase editing operations.
Figure 5.12 shows ArcSDE geodatabase replication support for disconnected Workgroup SDE geodatabase server editing operations.
Check-out operations were initially supported with the ArcGIS 8.3 release.
- ArcGIS Editor opens an edit session with the production geodatabase (creates edit version).
- Check-out of work area operational layers to workgroup SDE geodatabase (e.g., SQL Express).
- Check-out reference layers to mobile file geodatabase.
- Deploy for disconnected field edit operations.
- Multiple editors can check out from workgroup SDE geodatabase for mobile field operations.
- Reconcile and post field edits in mobile workgroup SDE geodatabase.
- Check-in field edits to version on central production geodatabase on return.
Distributed Geodatabase Operations can be used to support disconnected SDE geodatabase clients with incremental synchronization capabilities.
A versioned production geodatabase performs many functions related to the editing workflows that place processing demands on the server.
- Default version queries
- Multiple edit version sessions
- Reconcile and post operations
- Data schema table dependencies and relationships to maintain data consistency
- Geodatabase history archiving
- General maintenance operations
For many GIS operations, hundreds of users throughout the organization require access to the production data source, most requiring access to the published DEFAULT version.
Best practice: Separating the publishing database from the production database provides a more scalable and secure data management environment.
Several reasons an organization may want to use a separate publication geodatabase:
- More scalable server architecture (distributed database loads).
- More secure production environment (viewer access limited to a publication data source).
- Expand data center capacity (publication database can be hosted by cloud vendor).
- Limit public access to DMZ (publication database can be located in the DMZ).
Figure 5.13 shows how ArcSDE replication is used to share a version of the ArcSDE geodatabase on a separate publication geodatabase.
- Feature-level check-out to production SDE geodatabase was supported with the ArcGIS 8.3 release.
- One-way incremental multi-generation checkout to SDE geodatabase was supported with the ArcGIS 9.2 release.
- One-way incremental multi-generation checkout to file geodatabase was supported with the ArcGIS 9.3 release.
Best practice: ArcSDE replication is the best solution when moving part of a geodatabase to a separate geodatabase instance.
There are several advantages to using a separate publication geodatabase instance for sharing data to GIS viewers.
Organizations use one-way geodatabase replication for the following reasons:
- Improved performance and system scalability.
- Replicating to a simple feature (DEFAULT) read-only publication database can improve query performance and increase platform capacity.
Best practice: An SDE geodatabase should be used when live updates are required during peak viewing operations.
- Replicating to a file geodatabase can reduce DBMS processing loads and improve display performance.
Best practice: A separate file geodatabase instance should be provided for local access by each GIS server for optimum query performance.
Separating the publication database from the maintenance database improves data security.
- Limits direct user access to the production geodatabase.
- Provides web access to a separate copy of the published dataset.
- Filtered versions of the production geodatabase can be distributed to separate publication instances, based on enhanced security requirements.
Figure 5.14 shows how ArcSDE geodatabase replication can be used to move data to a different geodatabase schema. Geodatabase transition is a term used to represent replication to a geodatabase with a different schema.
Best practice: Geodatabase transition is the best solution for replicating part of a geodatabase to a separate geodatabase instance with a different schema.
The ArcGIS for Desktop Data Interoperability extension is used to create a transform script to translate data between the two schema.
Best practice: A service can be deployed on ArcGIS for Server for incremental one-way replication.
Distributed geodatabase operations
Figure 5.15 shows a distributed geodatabase configuration. ArcGIS Geodata services can be used to establish distributed geodatabase operations for scale-out database architectures.
Distributed regional office support:
- Centralized corporate geodatabase represents the parent production database.
- Versioned replica is provided to establish each regional production database.
- Regional editors work from their individual production geodatabases.
- Regional sites reconcile and post their edits before synchronizing with corporate.
- Regional updates are synchronized with the central parent corporate geodatabase.
- Corporate reconcile and post corporate and regional updates.
- Corporate updates are synchronized back to the regional geodatabases.
Warning: A system back-up should be completed after each synchronization. It is important that all child replicas remain synchronized with their parent versions located on the parent corporate geodatabase.
Distributed remote ArcGIS editor (disconnected geodatabase editing operations):
- Version replica provided to ArcGIS editor desktop geodatabase (SQL Express).
- Reference data layers are replicated to local file geodatabase (one-way geodatabase replication).
- ArcGIS editor can work in a disconnected mode when mobile.
- ArcGIS editor synchronizes changes with corporate when connected.
Best practice: Distributed geodatabase replication is convenient for single-user mobile editing operations.
Geodatabase replication can be used to build and support central data center high-capacity scale-out geodatabase operations:
- Parent production geodatabase established for regional production data integration.
- Versioned replica is provided to establish each regional production database.
- Regional editors' ArcGIS for Desktop applications are hosted on centralized terminal server farm.
- Regional editors work from their individual production geodatabases in the central data center.
- Regional child production databases reconcile and post their edits before synchronizing with parent geodatabase.
- Regional updates are synchronized with the parent geodatabase.
- Corporate editors reconcile and post regional version updates.
- Corporate updates are synchronized back with regional geodatabases.
Best practice: System back-up is provided after each synchronization from a common data center storage repository, to maintain consistency between parent and child database volumes.
Distributed geodatabase architecture provides a highly scalable computing infrastructure without placing high demands on any single database or server component.
GIS operations are expanding to include national and global architecture solutions.
Figure 5.16 shows a common pattern in managing hierarchical data sharing solutions. Different versions of the data must be shared at different levels within federated organizations. Geodatabase replication enables federated geodatabase operations to automate and manage these environments.
- GIS operations often extend well beyond the boundaries of a single organization.
Heavy arrows show supporting geodatabase replication services.
- Communities maintain their local production geodatabase, or share common regional data centers for hosting their local production geodatabase.
- ArcSDE replication maintains publication geodatabase instances for sharing local data with the broader communities.
- Local publication geodatabases are replicated to a regional (or national) production geodatabase.
- ArcSDE replication creates a national-level publication geodatabase for sharing at the national level.
- National publishing geodatabase can be hosted in the cloud for optimum service adaptability.
Best practice: Multi-tier federated architecture patterns are being implemented by a variety of national federal and military agencies. These architecture patterns are also supporting a variety of international corporations.
GIS imagery data architecture
Figure 5.17 shows the data source architecture patterns available to manage imagery data. Imagery data sources include aerial photography, elevation data, and satellite imagery.
Imagery is becoming the most valuable real-time business intelligence. Aerial photographs and satellite images can be collected in real time during a regional emergency and used to evaluate and respond to national disasters in a timely way. The volume of imagery is growing exponentially as technology for data collection and storage is rapidly evolving to leverage these digital information products.
Best practice: Rapid imagery collection and publication timelines are key to proper coordination and response to natural and man-made national disasters.
Imagery can provide valuable information when information products are examined over time. Views of the global ice caps and effects on ground cover can show information for managing climate change. Community development and agriculture changes can benefit from national imagery datasets showing changes in communities and farm products on a global scale.
An imagery file share is used to store the raw imagery files.
- Imagery can be stored and delivered when needed, maintaining optimum source quality.
- Single preprocessed images can be distributed using the ArcGIS for Server image service (without the Image extension).
- Multiple raw imagery files can be distributed using a mosaic dataset with the ArcGIS for Server Image extension.
Imagery is a primary resource for visualizing live global assets.
- The mosaic dataset provides the core functionality to organize and manage your imagery data resources.
- ArcGIS for Server provides Web client access to imagery data sources through a standard REST API.
- The mosaic dataset with the Imagery Extension enables on-the-fly processing, including multi-image mosaicking as it is compiled through the ArcGIS for Server image service.
- An imagery service cache is a collection of preprocessed tiled map images stored at multiple scales for rapid dissemination.
- Imagery historical archiving provides online access to imagery resources showing change over time. Hardware vendor content-addressable storage (CAS) solutions provide protected long term storage for imagery file repositories.
What is a mosaic dataset?
Figure 5.18 shows a mosaic dataset. The mosaic dataset was developed to manage and deploy imagery information products in a time-sensitive workflow environment.
[A mosaic dataset] consists of:
- A catalog that provides the source of the pixels and footprints of the rasters.
- A feature class that defines the boundary.
- A set of mosaicking rules that are used to dynamically mosaic the rasters.
- A set of properties used to control the mosaicking and any image extraction.
- A table for logging during data loading and other operations.
- Optionally, a seam line feature class for seam line mosaicking.
- Optionally, a color correction table that defines the color mapping for each raster in the raster catalog.
ArcGIS image access patterns
Figure 5.19-22 shows a variety of available ArcGIS imagery deployment patterns. Imagery deployment patterns include ArcGIS for Desktop direct access to imagery, ArcGIS for Server image service access to single preprocessed imagery raster datasets, ArcGIS for Server with Image Extension license access to multiple imagery files with on-the-fly processing, and direct access to preprocessed imagery cache tiles.
ArcGIS for Desktop direct access to imagery
ArcGIS for Desktop provides direct access to imagery resources. ArcGIS for Desktop is used to create a mosaic dataset and perform imagery analysis. Mosaic dataset can be used as a common search engine for organizing and accessing local imagery resources. Creating, editing, or working with imagery is a core ArcGIS for Desktop capability and does not require an ArcGIS Image extension.
ArcGIS for Server image service access to imagery
ArcGIS for Server capabilities include Image Service access to pre-processed imagery. Imagery must be preprocessed and mosaicked as a raster dataset for direct access by ArcGIS for Server Image Service (without the Image Extension). An imagery raster dataset (providing single image access) can be stored in a file system or loaded into a geodatabase.
Warning: Imagery raster datasets tend to be quite large (can occupy several terabytes). Copying large volumes of raster data into a geodatabase challenges even the most experienced database administrators.
ArcGIS for Server Image Extension access to imagery
The ArcGIS for Server Image Extension enables Image Service access to multiple imagery files. The ArcGIS Image Extension is a license added to ArcGIS for Server, which extends the capability of serving raster data. Specifically, it allows you to use the mosaic dataset and on-the-fly imagery processing through the ArcGIS for Server Image service.
Best practice: With the ArcGIS Image extension, you can serve an imagery repository through a mosaic dataset or raster dataset layers.
The ArcGIS for Server Image Extension gives you the ability to:
- Put your valuable imagery to use quickly.
- Serve collections of imagery or lidar data as image services.
- Dynamically create and serve mosaics from the original imagery, without the need to pre-compute the mosaics.
- Serve multiple views using the original imagery.
- Access the catalogs of imagery that make up the mosaic dataset.
- Exploit overlapping imagery, perform on-the-fly image processing, and explore temporal changes, using the advanced image-serving capabilities of this extension.
ArcGIS Imagery cache tiles
ArcGIS for Server provides direct access to preprocessed imagery cache. The image service cache is a preprocessed pyramid of imagery tiles configured at a range of scales. [Image service caching] improves the performance of image services in client applications. When accessing cached imagery using the enable cache view mode, the preprocessed cached tiles are sent to the client without future processing.
When you add an imagery cache with an image service, you end up with a dual-purpose image service that is accessed depending on its purpose. One purpose is to provide the fastest access to the image as a tiled service. The other purpose is to provide access through the mosaic dataset to the imagery repository, for queries, downloading, access to individual items, and to use in processing and analysis. Both options are available through a single image service starting with the ArcGIS 10.1 release.
Specific benefits of a cached image service include:
- Improved performance for basemap images.
- Skips overview generation.
- Improved performance for slow formats.
Best practice: If your image service is being used as a basemap image (like a map service to serve an image or as a background image), without expecting users to modify any of the properties of the image service, such as changing the mosaic methods, or performing a query, then caching is recommended for improved performance and scalability.
Imagery deployment workflow
Figure 5.23 shows the optimum imagery service configuration for optimum web delivery.
Recommended image caching workflow:
- Create mosaic dataset.
- Serve image services to key users (dynamic).
- Create map cache for larger web community.
- Maintain mosaic dataset.
- Update cache.
Image service cache:
- Provides static background image.
- Delivered as tiles for web caching.
- Most scalable web delivery.
- Created and served using mosaic dataset with ArcGIS for Server.
- Preprocessing and on-demand caching options.
Best practice: When you cache an image service, you end up with a dual-purpose image service that is accessed depending on its purpose. Caching is only required when you must create the fastest possible service containing image data. Generally, the pyramids generated for raster datasets or the overviews generated for mosaic datasets result in image data being served at an acceptable rate. However, if you know that a particular image or area of interest will be repeatedly visited, you may want to generate a cache.
CPT Platform Capacity Calculator custom imagery services
The CPT Platform Capacity Calculator can be used to demonstrate variation in Image service performance due to the selected data source format.
CPT Platform Capacity Calculator can be configured to show performance of the seven (7) available Imagery data source formats.
GIS enterprise data architecture
Figure 5.24 shows the GIS enterprise data architecture. GIS enterprise architecture often includes both feature and imagery data within the data center, requiring effective data management and automation to maintain the variety of data sources. The GIS data administrator must manage a hybrid architecture containing a mix of resources stored on file systems and multiple database platforms. ArcGIS technology provides a variety of processing and replication functions for maintaining data resources in an optimum configuration.
Your optimum data configuration will depend on your business needs.
Best practice: Data architecture solutions are unique to each business operation. Review your user requirements and operational needs to establish the optimum data architecture for your business.
Storage architecture options
Storage technology has evolved over the past 20 years to improve data access and provide better management of available storage resources. Understanding the advantages of each technical solution will help you select the storage architecture that best supports your needs.
Advent of the storage area network
Figure 5.25 shows the evolution of the storage area network. A storage area network provides an optimum data management solution for a data center with many database servers.
Local disk storage is provided for desktop workstations.
Best practice: Optimum workstation configurations today include two local disks for enhanced display performance and data protection.
Internal disk storage was provided in file servers following the same pattern used for desktop workstations.
- Many enterprise business solutions, including GIS, moved their data resources to database management systems in the 1990s.
- Database servers were initially purchased with larger internal storage bays to accommodate higher capacity data storage volumes.
Internal disk storage architecture started to cause a problem for larger data centers, as they found their disk storage assets would be siloed in dedicated server platforms, some with too much disk capacity and others with too little. A more adaptive storage management solution was needed.
Direct attached storage (DAS) moves the primary storage volumes into a separate platform tier with data volumes that can be assigned to servers as required to satisfy data storage needs.
- Server host bus adapters (HBA) and fiber channel communications maintained the same Small Computer System Interface (SCSI) communication protocol used for internal storage.
- Multiple fiber channel connections were provided to connect a single storage array with multiple database servers.
- Disk volumes were allocated as required to each database server and were not shared.
- Storage arrays were configured with redundant components to support operational high availability requirements.
As the data centers grew, there was a demand for more ways to allocate storage volumes to the growing number of database server platforms.
Storage area networks (SAN) establish a network between the multiple database servers and the multiple storage arrays providing adaptive connectivity for assigning storage volumes to the server platform tier as required to meet operational needs.
- The initial SAN switches provided fiber channel port connections compatible with existing HBA and storage array cabling.
- SAN establishes a fiber channel network for routing the storage traffic.
- Any storage volume on the storage array tier could be assigned to any database platform on the server tier.
Best practice: SANs have became an optimum storage solution for large data center environments with many database servers.
Advent of network-attached storage
Figure 5.26 shows the evolution of network attached storage. Network attached storage provides an optimum data management solution for high-capacity file-based storage (high-availability file share appliances). Direct-attached storage provided the first business case in favor of network attached storage appliances.
Direct-attached storage (DAS) had disk storage volumes dedicated to a specific assigned server. Two clustered file servers were required to create a high-availability file share. This DAS file share pattern was expensive and difficult to manage.
Network-attached storage (NAS) provided a high-availability file share appliance which would simply connect to a standard Internet Protocol (IP) local area network switch (file share protocols and full high availability redundancy were included with the NAS appliance). The NAS appliance provides a simple plug-and-play solution for including a file share on a local area network.
As NAS technology started to become popular (late 1990s), most enterprise business solutions, including GIS, were moving their data to database management systems. The SAN solution was winning over the NAS for most large data center deployments.
A new communication protocol, Simple Computer Storage Interface over IP (ISCSI) was developed for building SAN solutions from NAS appliance systems. ISCSI storage area network:
- Supports transmission of the traditional SCSI block protocol over IP networks.
- Provides database vendors dedicated storage volumes
- IP switch bandwidth also increased to higher capacity than what was achieved by the fiber channel switches, helping with the performance concerns (10 Gbps IP switch technology).
- The NAS appliances architecture can provide an adaptive solution for both IP file sharing and ISCSI database storage management.
Best practice: NAS provides an optimum storage architecture for enterprise GIS operations. Most enterprise GIS data centers include a mix of SAN (fiber channel or ISCSI) and NAS solutions to satisfy their data management needs.
RAID (Redundant array of independent disks)
Enterprise GIS environments depend heavily on GIS data to support a variety of critical business processes. Data is one of the most valuable resources of a GIS, and protecting data is fundamental to supporting critical business operations.
The primary data protection line of defense is provided by the storage solutions. Most storage vendors have standardized on redundant array of independent disks (RAID) storage solutions for data protection. A brief overview of basic storage protection alternatives includes the following:
Just a Bunch of Disks (JBOD): A disk volume with no RAID protection is referred to as just a bunch of disks configuration, or (JBOD). This represents a configuration of disks with no protection and no performance optimization.
RAID 0: A disk volume in a RAID 0 configuration provides striping of data across several disks in the storage array. Striping supports parallel disk controller access to data across several disks reducing the time required to locate and transfer the requested data. Data is transferred to array cache once it is found on each disk. RAID 0 striping provides optimum data access performance with no data protection. One hundred percent of the disk volume is available for data storage.
RAID 1: A disk volume in a RAID 1 configuration provides mirror copies of the data on disk pairs within the array. If one disk in a pair fails, data can be accessed from the remaining disk copy. The failed disk can be replaced and data restored automatically from the mirror copy without bringing the storage array down for maintenance. RAID 1 provides optimum data protection with minimum performance gain. Available data storage is limited to 50 percent of the total disk volume, since a mirror disk copy is maintained for every data disk in the array.
RAID 3 and 4: A disk volume in a RAID 3 or RAID 4 configuration supports striping of data across all disks in the array except for one parity disk. A parity bit is calculated for each data stripe and stored on the parity disk. If one of the disks fails, the parity bit can be used to recalculate and restore the missing data. RAID 3 provides good protection of the data and allows optimum use of the storage volume. All but one parity disk can be used for data storage, optimizing use of the available disk volume for data storage capacity.
There are some technical differences between RAID 3 and RAID 4, which, for our purposes, are beyond the scope of this discussion. Both of these storage configurations have potential performance disadvantages. The common parity disk must be accessed for each write, which can result in disk contention under heavy peak user loads. Performance may also suffer because of requirements to calculate and store the parity bit for each write. Write performance issues are normally resolved through array cache algorithms on most high-performance disk storage solutions.
The following RAID configurations are the most commonly used to support ArcSDE storage solutions. These solutions represent RAID combinations that best support data protection and performance goals.
Figure 5.27 shows the RAID 1/0 storage disk layout. RAID 1/0 is a composite solution including RAID 0 striping and RAID 1 mirroring.
- Optimum solution for high performance and data protection.
- Highest cost solution. Available data storage is limited to 50 percent of the total disk volume, since a mirror disk copy is maintained for every data disk in the array.
Best practice: High-activity database index tables and log files are best located on RAID 1/0 storage volumes.
Figure 5.28 shows the RAID 5 storage disk layout. RAID 5 supports striping of data across all disks in the array except for one parity disk.
- A parity bit is calculated for each data stripe and stored on the parity disk.
- If one disk fails, the parity bit can be used to recalculate and restore the missing data.
- Provides optimum disk utilization and near optimum performance, supporting disk storage on all but one parity disk volume.
Best practice: GIS feature data and imagery files can be located on striped RAID 5 disks with minimum performance impact.
Figure 5.29 shows the RAID 6 storage disk layout. RAID 6 supports striping of data across all disks in the array except for two parity disks.
- If one or two disks fail, the parity bit can be used to recalculate and restore the missing data.
- RAID 6 provides improved protection and reduced data contention (larger disk volumes at same protection levels) than RAID 5.
- Primary driver for RAID 6 was the longer rebuild times required for the larger volume disks (higher risk for two-disk failure).
Best practice: Use RAID 6 to reduce concern of data loss due to a two concurrent disk failure scenario.
- Will storage be the next bottleneck?
Technology is improving display performance and moving more data faster and more efficiently throughout the server, network, and storage infrastructure.
- Larger disk volumes (reduced number of disks in array)
- Increasing storage traffic loads (cached tiles, imagery files)
- Faster display processing
- Higher capacity servers
- Larger peak concurrent GIS user loads
Warning: All technology advances point toward increased potential for disk contention.
The good news is that there are technical solutions available to resolve contention. It is also quite simple to monitor disk I/O performance and identify if disk contention is a problem.
- Cache data files on edge servers or web accelerator appliances (servers located near the site communication access points).
- Disk contention can also be reduced with RAID data stripping – distributing the data across larger RAID volumes can reduce the probability of disk contention.
- Solid-state disk drives are available in the current marketplace; solutions that deliver data over 1000 times faster than current mechanical disk drives.
Note: The cost for solid-state drives is currently much higher than their mechanical counterparts—this can change as the solid state market sales volume increases (vendors are waiting for an opportunity to upgrade storage solutions).
- Be aware of the potential for performance bottlenecks within your storage environment.
- Monitor disk I/O performance to identify when disk contention is causing a performance delay.
- Be aware that there are solutions to disk I/O performance problems, and take appropriate action to address performance issues when they occur.
Ways to move GIS data
GIS data can be moved using an extract and load process or by forms of replication. The optimum method for moving your data will be determined by the amount of data to be moved and the associated business processes.
Traditional tape backup/disk copy
Figure 5.30 shows ways to move large datasets. Large volumes of data are best moved on disk, DVD, or tape back-up.
Moving large volumes of data across shared network resources is very expensive, and impacts performance for all users on that network segment. Network bandwidth is a valuable and limited resource. If you conserve network bandwidth for primary display and query tasks, you will have more capacity available for better display performance.
Figure 5.31 shows a database replication architecture. Commercial database replication works well for maintaining a back-up failover database.
Best practice: Use database replication when you want to create and maintain a complete copy of the database environment. Database vendors provide optimum tools for data protection, back-up, and recovery operations.
Warning: ArcSDE geodatabase replication should not be used for back-up and recovery operations.
Figure 5.32 shows a storage replication architecture. Storage disk-level replication provides the optimum solution for data center back-up and recovery operations.
Storage vendors typically provide incremental snapshot back-ups to local and remote data center locations. Many of these solutions include proven recovery tools and are widely in use. Back-up from storage volumes avoids database complexity issues, data is replicated at the disk storage block-level.
Best practice: Use storage disk replication when you want to create and maintain a complete copy of the data storage environment. Storage vendors provide optimum tools for data protection, back-up, and recovery operations.
Protect your GIS data resources
Data protection at the disk level minimizes the need for system recovery in the event of a single disk failure but will not protect against a variety of other data failure scenarios. It is always important to keep a current backup copy of critical data resources, and maintain a recent copy at a safe location away from the primary site. Figure 5.33 highlights data backup strategies available to protect your business operations. It is important to maintain a snapshot back-up or copy of your data. A large percentage of data loss is caused by human error. The best way to protect your data is to maintain a reliable periodic point-in-time back-up strategy.
The type of backup system you choose for your business will depend on your business needs. For simple low priority single use environments, you can create a periodic point-in-time backup on a local disk or tape drive and maintain a recent off-site copy of your data for business recovery. For larger enterprise operations, system availability requirements may drive requirements for failover to backup Data Centers when the primary site fails. Your business needs will drive the level of protection you need.
Data backups provide the last line of defense for protecting our data investments. Careful planning and attention to storage backup procedures are important factors to a successful backup strategy. Data loss can result from many types of situations, with some of the most probable situations being administrative or user error.
Host Tape Backup: Traditional server backup solutions use lower-cost tape storage for backup. Data must be converted to a tape storage format and stored in a linear tape medium. Backups can be a long drawn out process taking considerable server processing resource (typically consume a CPU during the backup process) and requiring special data management for operational environments.
For database environments, point-in-time backups are required to maintain database continuity. Database software provide for online backup requirements by enabling a procedural snapshot of the database. A copy of the protected snapshot data is retained in a snapshot table when changes are made to the database, supporting point-in-time backup of the database and potential database recovery back to the time of the snapshot.
Host processors can be used to support backup operations during off-peak hours. If backups are required during peak-use periods, backups can impact server performance.
Network Client Tape Backup: The traditional online backup can often be supported over the LAN with the primary batch backup process running on a separate client platform. DBMS snapshots may still be used to support point-in-time backups for online database environments. Client backup processes can contribute to potential network performance bottlenecks between the server and the client machine because of the high data transfer rates during the backup process.
Storage Area Network Client Tape Backup: Some backup solutions support direct disk storage access without impacting the host DBMS server environment. Storage backup is performed over the SAN or through a separate storage network access to the disk array with batch process running on a separate client platform. A disk-level storage array snapshot is used to support point-in-time backups for online database environments. Host platform processing loads and LAN performance bottlenecks can be avoided with disk-level backup solutions.
Disk Copy Backup: The size of databases has increased dramatically in recent years, growing from tens of gigabytes to hundreds of gigabytes and, in many cases, terabytes of data. Recovery of large databases from tape backups is very slow, taking days to recover large spatial database environments. At the same time, the cost of disk storage has decreased dramatically providing disk copy solutions for large database environments competitive in price to tape storage solutions. A copy of the database on local disk, or a copy of these disks to a remote recovery site, can support immediate restart of the DBMS following a storage failure by simply restarting the DBMS with the backup disk copy.
There are several names for disk backup strategies (remote backup, disaster recovery plan, business continuance plan, continuation of business operations plan, etc). The important thing is that you consider your business needs, evaluate risks associated with loss of data resources, and establish a formal plan for business recovery in the event of data loss.
The question is not if, but when. Most people will, at some time, experience loss of valuable data and business resources.
Data Management Overview
Support for distributed database solutions has traditionally introduced high-risk operations, with potential for data corruption and use of stale data sources in GIS operations. There are organizations that support successful distributed solutions. Their success is based on careful planning and detailed attention to their administrative processes that support the distributed data sites. More successful GIS implementations support central consolidated database environments with effective remote user performance and support. Future distributed database management solutions may significantly reduce the risk of supporting distributed environments. Whether centralized or distributed, the success of enterprise GIS solutions will depend heavily on the administrative team that keeps the system operational and provides an architecture solution that supports user access needs.
The next chapter will discuss Network Communications, providing some insight on how to build GIS solutions that support remote user performance and Enterprise GIS scalability.
GIS Data Administration 32nd Edition
GIS Data Administration 31st Edition
GIS Data Administration 30th Edition
GIS Data Administration 29th Edition
GIS Data Administration 28th Edition
GIS Data Administration 27th Edition