GIS Data Administration

From Wiki.GIS.com

Jump to:navigation, search
System Design Strategies (select here for table of contents)
1. System Design Process 2. GIS Software Technology 3. Software Performance 4. Server Software Performance
5. GIS Data Administration 6. Network Communications 7. GIS Product Architecture 8. Platform Performance
9. Information Security 10. Performance Management 11. System Implementation 12. City of Rome
A1. Capacity Planning Tool B1. Windows Memory Management Preface (Executive Summary) SDSwiki What's New


Spring 2015 GIS Data Administration 36th Edition

Data provides the resources you need to make proper business decisions. The information products required to make business decisions determine the critical data resources that must be available for business operations. How you organize and maintain your data resources will contribute to system performance and user productivity.

How GIS data is managed has changed dramatically over the past 10 years. Much of this change is driven by technology. The big focus in the 1990s was to move GIS data resources together in an SDE geodatabase, where users could better manage and share enterprise data resources. Data management today includes multiple publication formats to improve display performance and capture change over time.

A variety of data management and distribution strategies are available today to improve data access and dissemination throughout the rapidly expanding GIS user community. The volume of data you must sort through each day is growing exponentially. How you manage, organize, and control these data resources is critical to your success.


Contents

GIS feature data architecture

Figure 5.1 GIS feature data architecture includes ArcSDE geodatabase (production source), geodatabase (GDB) archive, distribution database (publication data source), and vector map cache.

Figure 5.1 shows the data source architecture patterns available to manage GIS feature data. GIS feature data includes points, polygons, lines, complex features, and associated attributes. Additional content may include parcel fabric, cartographic representations, lidar point elevation, terrain data, etc.

Geospatial data is the core integration of business intelligence. How you organize your data contributes directly to your business complexity and drives the performance of your business operations. Good data management empowers your ability to make proper business decisions.

There are only two kinds of data: useful data and useless data. Useful data is what you use to create business information products and enable informed business decisions. Useless data is what you do not use, and can rapidly increase the complexity of our data repository.

Best practice: Data should be organized and managed to empower proper business decisions and optimize user productivity.

GIS Feature Data Production Database

A production data source is an ArcSDE geodatabase used to organize and manage your geospatial feature data resources.

Best practice: A production data source provides a single integrated repository for all enterprise-level geospatial feature data resources.

GIS Feature Data Publication Database

A publication geodatabase is an ArcSDE or file geodatabase used to optimize distribution of finalized geospatial data resources.

Best practice: Distribute geospatial business operational layers in a publication geodatabase.

GIS Feature Data Map Cache

A feature data map cache is a collection of preprocessed tiled map images stored at multiple scales for rapid dissemination.

Best practice: Distribute geospatial static basemap layers in a preprocessed map cache.

GIS Feature Data Archiving

Geodatabase (GDB) archiving (shown as an available component of the production data source) includes functionality to record and access changes made to all or a subset of data in a versioned geodatabase.


CPT Platform Capacity Calculator Custom Web Mapping Services

The CPT Platform Capacity Calculator is a simple tool for evaluating selected platform capacity. The default tool, located at the bottom of the CPT Hardware tab, includes a variety of standard workflows that demonstrate platform capacity. For analysis and reporting purposes, you may want to change the default list of sample workflows and include those workflows you are evaluating in your own design environment. This link describes how you can change the Platform Capacity Calculator workflow samples to a custom set of workflows for demonstration purposes.

ArcSDE Geodatabase

Release of ArcGIS technology introduced the ArcSDE geodatabase, which provides a way to manage long transaction edit sessions within a single database instance. ArcSDE supports long transactions using versions (different views) of the database. A geodatabase can support thousands of concurrent versions of the data within a single database instance. The default version represents the real world, and other named versions are proposed changes and database updates in work.


What is versioning?

Figure 5.2 A versioned geodatabase manages multiple edit sessions over time without duplicating data.

Geodatabase versioning allows multiple users to edit the same data in an ArcSDE geodatabase without applying locks or duplicating data. Figure 5.2 provides a drawing of a versioned geodatabase workflow.

Users always access an ArcSDE geodatabase through a version. When you connect to a multiuser geodatabase, you specify the version to which you will connect. By default, you connect to the DEFAULT version.

Best practice: Use a versioned geodatabase when managing multiple edit sessions of common feature datasets over time.


Geodatabase versioning example

Figure 5.3 Typical long transaction workflow lifecycle includes an initial edit session to develop a prototype design, a relatively long construction phase, and a final as-built design update before final posting.

GIS users have many use-cases in which long transaction workflows are critical. Figure 5.3 shows a long transaction workflow for developing a new community housing subdivision.

A new housing subdivision is being approved by the city.


ArcSDE explicit state model

Figure 5.4 Versioned geodatabase includes a DEFAULT lineage and multiple open version lineages saved at various states over time.

Figure 5.4 shows the progress of an versioned workflow edit session over time.

The diagram shows DEFAULT version lineage and new version lineages.


ArcSDE version state tuning

Figure 5.5 ArcSDE version state tuning functions provide administrators with tools to improve database performance.

Figure 5.5 shows the ArcSDE Geodatabase DEFAULT version state tree. Enterprise ArcSDE production database maintenance environments often support many GIS editors posting many changes to the geodatabase DEFAULT lineage over time. In many scenarios, the DEFAULT lineage tree can rapidly grow to hundreds and even thousands of state changes. Many of the states are redundant and can cause database tables to grow in size. Reducing the size of these tables can improve database performance and reduce maintenance overhead.

Compressing the state tree removes each state that is not currently referenced by a version and is not the parent of multiple child states. When executed by the ArcSDE administrator, all states that meet the compression criteria are removed, regardless of owner. All other users can compress only the states that they own. This operation reduces the depth of the state tree, shortening the lineage and improving query performance.

Best practice: DEFAULT state tree should be compressed on a periodic schedule to maintain optimum query performance.

You can also trim the state tree, which collapses a linear branch of the tree. This is done by reconciling and saving an edit lineage to a more current state (for example, version t1 could be reconciled and saved to reference state 7, freeing state 1 for compression). Trimming database states will reduce the depth of the state tree, shortening the lineage and improving query performance.

Best practice: The DEFAULT state tree may need to be trimmed to reduce number of long transaction reference states for optimum query performance.


Versioned geodatabase view

Figure 5.6 Several tables are added to the geodatabase schema when establishing a versioned geodatabase. Two key tables include the Adds table and the Deletes table.

Figure 5.6 shows the key database tables supporting a versioned geodatabase view. Several additional tables are included in a versioned geodatabase. ArcSDE uses the additional tables to manage multiple concurrent edit sessions and query access to different views of the data.

For example, a user query of the current DEFAULT view would include features (rows) in the Base table, plus any posted rows from the Adds table, minus any posted rows from the Deletes table. This same approach would be used to view open Edit versions, with ArcSDE sorting out what features were required to populate the version view.


Versioning managed by ArcSDE schema

Figure 5.7 ArcSDE geodatabase includes the ArcSDE schema and the user schema.

Figure 5.7 shows the ArcSDE geodatabase schema. The ArcSDE geodatabase includes the ArcSDE schema and the user schema. The ArcSDE geodatabase license key must be installed with the ArcSDE schema.

The ArcSDE schema is used to manage operations within the versioned database environment.

Note: Multiple user schema instances were supported with the ArcGIS 9.2 release.


Geodatabase replication use-cases

ArcSDE manages the versioning schema of the geodatabase and supports client application access to the appropriate views of the geodatabase. ArcSDE also supports export and import of data from and to the appropriate database tables and maintains the geodatabase scheme defining relationships and dependencies between the various tables.

Figure 5.9 There are four basic types of geodatabase replication patterns that enable a broad variety of distributed enterprise and federated architecture scenarios for GIS deployment.

ArcSDE provides a variety of data replication options associated with a versioned geodatabase as shown in Figure 5.9.

Note: Replication is the process of sharing data so as to ensure consistency between redundant sources. Geodatabase replication provides filtered data sharing at the version level.

Mobile operations:

Production/publication:

Distributed operations:

Hierarchical operations:


Distributed enterprise architecture strategies

Figure 5.10 Enterprise GIS operations often involve a mix of geodatabase replication patterns including mobile, publication, and distributed operations.

Enterprise GIS operations often include a variety of geodatabase replication functions as shown in Figure 5.10.

The four red arrows on the chart show the use of geodatabase replication services.

Mobile operations

Mobile field operations are a big part of GIS workflows within most organizations.

Best practice: Mobile operations provide a way to integrate field operators into the computerized business workflow processes, improving business efficiency and reducing effort required to get data into the computerized systems.


ArcSDE geodatabase replication support for disconnected editing operations.

Figure 5.11 Mobile desktop operations support ArcGIS client remote field editing.

Figure 5.11 shows ArcSDE geodatabase replication support for disconnected desktop SDE geodatabase client editing operations.

Note: Disconnected editing extends the geodatabase to provide clients with the capability to perform edit operations in the field when not connected to the central geodatabase.

Check-out operations were initially supported with the ArcGIS 8.3 release.


ArcSDE geodatabase replication support for disconnected Workgroup SDE geodatabase editing operations.

Figure 5.12 Mobile workgroup operations support remote SDE geodatabase field operations with multiple editors.

Figure 5.12 shows ArcSDE geodatabase replication support for disconnected Workgroup SDE geodatabase server editing operations.

Check-out operations were initially supported with the ArcGIS 8.3 release.

Distributed Geodatabase Operations can be used to support disconnected SDE geodatabase clients with incremental synchronization capabilities.

Production/publication operations

A versioned production geodatabase performs many functions related to the editing workflows that place processing demands on the server.

For many GIS operations, hundreds of users throughout the organization require access to the production data source, most requiring access to the published DEFAULT version.

Best practice: Separating the publishing database from the production database provides a more scalable and secure data management environment.

Several reasons an organization may want to use a separate publication geodatabase:


Figure 5.13 One-way replication from a production geodatabase to a publication geodatabase.

Figure 5.13 shows how ArcSDE replication is used to share a version of the ArcSDE geodatabase on a separate publication geodatabase.

Best practice: ArcSDE replication is the best solution when moving part of a geodatabase to a separate geodatabase instance.

There are several advantages to using a separate publication geodatabase instance for sharing data to GIS viewers.

Organizations use one-way geodatabase replication for the following reasons:

Best practice: An SDE geodatabase should be used when live updates are required during peak viewing operations.
Best practice: A separate file geodatabase instance should be provided for local access by each GIS server for optimum query performance.

Separating the publication database from the maintenance database improves data security.


Extract/transform/load operations

Figure 5.14 Geodatabase transformation between geodatabases with different schema is provided by the data interoperability extension.

Figure 5.14 shows how ArcSDE geodatabase replication can be used to move data to a different geodatabase schema. Geodatabase transformation is a term used to represent replication to a geodatabase with a different schema.

Best practice: Geodatabase transform is the best solution for replicating part of a geodatabase to a separate geodatabase instance with a different schema.

The ArcGIS for Desktop Data Interoperability extension is used to create a script to transform data between the two schema.

Best practice: A service can be deployed on ArcGIS for Server for incremental one-way replication.



Distributed geodatabase operations

Figure 5.15 Distributed geodatabase replication enables functionality for a single SDE geodatabase schema synchronized across multiple SDE geodatabase platforms.

Figure 5.15 shows a distributed geodatabase configuration. ArcGIS Geodata services can be used to establish distributed geodatabase operations for scale-out database architectures.

Distributed regional office support:

Warning: A system back-up should be completed after each synchronization. It is important that all child replicas remain synchronized with their parent versions located on the parent corporate geodatabase.

Distributed remote ArcGIS editor (disconnected geodatabase editing operations):

Best practice: Distributed geodatabase replication is convenient for single-user mobile editing operations.

Geodatabase replication can be used to build and support central data center high-capacity scale-out geodatabase operations:

Best practice: System back-up is provided after each synchronization from a common data center storage repository, to maintain consistency between parent and child database volumes.

Distributed geodatabase architecture provides a highly scalable computing infrastructure without placing high demands on any single database or server component.


Hierarchical operations

GIS operations are expanding to include national and global architecture solutions.

Figure 5.16 ArcSDE geodatabase replication can be used to manage multiple tiers of data layers throughout a global environment.

Figure 5.16 shows a common pattern in managing hierarchical data sharing solutions. Different versions of the data must be shared at different levels within federated organizations. Geodatabase replication enables federated geodatabase operations to automate and manage these environments.

GIS operations often extend well beyond the boundaries of a single organization.

Heavy arrows show supporting geodatabase replication services.

Best practice: Multi-tier federated architecture patterns are being implemented by a variety of national federal and military agencies. These architecture patterns are also supporting a variety of international corporations.


ArcGIS for Desktop direct connection to supported DBMS content

Figure 5.16.1 ArcGIS for Desktop direct supported DBMS connections for edit, view, query, and analysis operations.
ArcGIS for Desktop software can connect to supported database (DBMS) content for edit, view, query, and analysis operations. Figure 5.16.1 shows the available database connection architecture patterns.

ArcGIS for Desktop software provides direct connections to supported database servers enabling view, query and analysis of the DBMS data content. Some of the databases you access can contain geodatabase tables, functions, and procedures, but they don't have to; you can connect to any supported database and view the data from ArcGIS for Desktop.

ArcSDE geodatabases, also known as multiuser geodatabases, are stored in a relational database using Oracle, Microsoft SQL Server, IBM DB2, IBM Informix, or PostgreSQL. These geodatabases require the use of ArcSDE and can be unlimited in size and numbers of users.

ArcMap allows you to edit a supported database by creating a local copy of data from a published ArcGIS for Server feature service. You can then make edits to the local copy in ArcMap and synchronize the edits back to the service. Edits can be made to the local copy without having to be connected to the server. Access to the server is only required when creating the local copy or applying changes from the local copy to the server. This workflow can be useful when your organization has disconnected employees and provides a common method for editing the same data using multiple clients, such as through the web or using desktop applications. The functionality is built into ArcMap and does not require any customizations. Edits to a published feature service are captured in a single version of the database.

GIS imagery data architecture

Figure 5.17 GIS imagery data architecture includes ArcGIS Online, pre-processed imagery files, raw imagery files managed by the mosaic dataset, image service cache, and historical imagery archives (history).

Figure 5.17 shows the data source architecture patterns available to manage imagery data. Imagery data sources include aerial photography, elevation data, and satellite imagery.

Imagery is becoming the most valuable real-time business intelligence. Aerial photographs and satellite images can be collected in real time during a regional emergency and used to evaluate and respond to national disasters in a timely way. The volume of imagery is growing exponentially as technology for data collection and storage is rapidly evolving to leverage these digital information products.

Best practice: Rapid imagery collection and publication timelines are key to proper coordination and response to natural and man-made national disasters.

Imagery can provide valuable information when information products are examined over time. Views of the global ice caps and effects on ground cover can show information for managing climate change. Community development and agriculture changes can benefit from national imagery datasets showing changes in communities and farm products on a global scale.

An imagery file share is used to store the raw imagery files.

Imagery is a primary resource for visualizing live global assets.


What is a mosaic dataset?

Figure 5.18 A mosaic dataset is a set of tools and a metadata catalog for managing a GIS imagery repository.

Figure 5.18 shows a mosaic dataset. The mosaic dataset was developed to manage and deploy imagery information products in a time-sensitive workflow environment.

[A mosaic dataset] consists of:


ArcGIS image access patterns

Figure 5.19-22 Imagery can be accessed from ArcGIS for Desktop, ArcGIS for Server image services (preprocessed and on-the-fly processing), and ArcGIS Image Cache.

Figure 5.19-22 shows a variety of available ArcGIS imagery deployment patterns. Imagery deployment patterns include ArcGIS for Desktop direct access to imagery, ArcGIS for Server image service access to single preprocessed imagery raster datasets, ArcGIS for Server with Image Extension license access to multiple imagery files with on-the-fly processing, and direct access to preprocessed imagery cache tiles.

ArcGIS for Desktop direct access to imagery

ArcGIS for Desktop provides direct access to imagery resources. ArcGIS for Desktop is used to create a mosaic dataset and perform imagery analysis. Mosaic dataset can be used as a common search engine for organizing and accessing local imagery resources. Creating, editing, or working with imagery is a core ArcGIS for Desktop capability and does not require an ArcGIS Image extension.

ArcGIS for Server image service access to imagery

ArcGIS for Server capabilities include Image Service access to pre-processed imagery. Imagery must be preprocessed and mosaicked as a raster dataset for direct access by ArcGIS for Server Image Service (without the Image Extension). An imagery raster dataset (providing single image access) can be stored in a file system or loaded into a geodatabase.

Warning: Imagery raster datasets tend to be quite large (can occupy several terabytes). Copying large volumes of raster data into a geodatabase challenges even the most experienced database administrators.

ArcGIS for Server Image Extension access to imagery

The ArcGIS for Server Image Extension enables Image Service access to multiple imagery files. The ArcGIS Image Extension is a license added to ArcGIS for Server, which extends the capability of serving raster data. Specifically, it allows you to use the mosaic dataset and on-the-fly imagery processing through the ArcGIS for Server Image service.

Best practice: With the ArcGIS Image extension, you can serve an imagery repository through a mosaic dataset or raster dataset layers.

The ArcGIS for Server Image Extension gives you the ability to:

ArcGIS Imagery cache tiles

ArcGIS for Server provides direct access to preprocessed imagery cache. The image service cache is a preprocessed pyramid of imagery tiles configured at a range of scales. [Image service caching] improves the performance of image services in client applications. When accessing cached imagery using the enable cache view mode, the preprocessed cached tiles are sent to the client without future processing.

When you add an imagery cache with an image service, you end up with a dual-purpose image service that is accessed depending on its purpose. One purpose is to provide the fastest access to the image as a tiled service. The other purpose is to provide access through the mosaic dataset to the imagery repository, for queries, downloading, access to individual items, and to use in processing and analysis. Both options are available through a single image service starting with the ArcGIS 10.1 release.

Specific benefits of a cached image service include:

Best practice: If your image service is being used as a basemap image (like a map service to serve an image or as a background image), without expecting users to modify any of the properties of the image service, such as changing the mosaic methods, or performing a query, then caching is recommended for improved performance and scalability.


Imagery deployment workflow

Figure 5.23 The image service cache is preprocessed for rapid display performance. Image resources are available with the same image service

Figure 5.23 shows the optimum imagery service configuration for optimum web delivery.

Recommended image caching workflow:

Image service cache:

Best practice: When you cache an image service, you end up with a dual-purpose image service that is accessed depending on its purpose. Caching is only required when you must create the fastest possible service containing image data. Generally, the pyramids generated for raster datasets or the overviews generated for mosaic datasets result in image data being served at an acceptable rate. However, if you know that a particular image or area of interest will be repeatedly visited, you may want to generate a cache.


CPT Platform Capacity Calculator custom imagery services

The CPT Platform Capacity Calculator can be used to demonstrate variation in Image service performance due to the selected data source format.

Select a custom imagery workflow configuration on the CPT Platform Capacity Calculator tab

CPT Platform Capacity Calculator can be configured to show performance of the seven (7) available Imagery data source formats.

Selecting an imagery workflow on the CPT Calculator tab
Selecting the imagery workflow on the CPT Design tab

GIS enterprise data architecture

Figure 5.24 GIS enterprise data architecture is no longer just a geodatabase, it often includes a combination of both the GIS feature data and the imagery data resources supporting common services managed by an integrated data center operations.

Figure 5.24 shows the GIS enterprise data architecture. GIS enterprise architecture often includes both feature and imagery data within the data center, requiring effective data management and automation to maintain the variety of data sources. The GIS data administrator must manage a hybrid architecture containing a mix of resources stored on file systems and multiple database platforms. ArcGIS technology provides a variety of processing and replication functions for maintaining data resources in an optimum configuration.

Your optimum data configuration will depend on your business needs.

Best practice: Data architecture solutions are unique to each business operation. Review your user requirements and operational needs to establish the optimum data architecture for your business.


Storage architecture options

Storage technology has evolved over the past 20 years to improve data access and provide better management of available storage resources. Understanding the advantages of each technical solution will help you select the storage architecture that best supports your needs.


Advent of the storage area network

Figure 5.25 The storage area network evolved to satisfy adaptive storage management needs for data centers with many database servers.

Figure 5.25 shows the evolution of the storage area network. A storage area network provides an optimum data management solution for a data center with many database servers.

Local disk storage

Local disk storage is provided for desktop workstations, laptops, and mobile devices.

Best practice: Optimum workstation configurations today include two local disks for enhanced display performance and data protection.

Internal disk storage

Internal disk storage is provided in file servers following the same pattern used for desktop workstations.

Internal disk storage architecture started to cause a problem for larger data centers, as they found their disk storage assets would be silos of dedicated storage in server platforms, some with too much disk capacity and others with too little. A more adaptive storage management solution was needed.

Direct attached storage (DAS)

DAS moves the primary storage volumes into a separate platform tier with data volumes that can be assigned to servers as required to satisfy data storage needs.

As the data centers grew, there was a demand for more ways to allocate storage volumes to the growing number of database server platforms.

Storage area networks (SAN)

SAN establish a network between the multiple database servers and the multiple storage arrays providing adaptive connectivity for assigning storage volumes to the server platform tier as required to meet operational needs.

Best practice: SANs have became an optimum storage solution for large data center environments with many database servers.


Advent of network-attached storage

Figure 5.26 shows the evolution of network attached storage. Network attached storage provides an optimum data management solution for high-capacity file-based storage (high-availability file share appliances). Direct-attached storage provided the first business case in favor of network attached storage appliances.

Figure 5.26 Network attached storage was established as an optimum solution for high-availability file share environments.

Direct-attached storage (DAS) had disk storage volumes dedicated to a specific assigned server. Two clustered file servers were required to create a high-availability file share. This DAS file share pattern was expensive and difficult to manage.

Network-attached storage (NAS)

NAS provided a high-availability file share appliance which would simply connect to a standard Internet Protocol (IP) local area network switch (file share protocols and full high availability redundancy were included with the NAS appliance). The NAS appliance provides a simple plug-and-play solution for including a file share on a local area network.

As NAS technology started to become popular (late 1990s), most enterprise business solutions, including GIS, were moving their data to database management systems. The SAN solution was winning over the NAS for most large data center deployments.

Simple Computer Storage Interface over IP (iSCSI)

iSCSI was developed for building SAN solutions from NAS appliance systems (iSCSI storage area network).

Fiber Channel over Ethernet (FCoE)

FCoE was developed to provide SCSI block protocol over standard Ethernet networks without the IP routing overhead.

Best practice: NAS provides an optimum storage architecture for enterprise GIS operations. Most enterprise GIS data centers include a mix of SAN (FC, FCoE, or iSCSI) and NAS solutions to satisfy their data management needs.

RAID (Redundant array of independent disks)

Enterprise GIS environments depend heavily on GIS data to support a variety of critical business processes. Data is one of the most valuable resources of a GIS, and protecting data is fundamental to supporting critical business operations.

The primary data protection line of defense is provided by the storage solutions. Most storage vendors have standardized on redundant array of independent disks (RAID) storage solutions for data protection. A brief overview of basic storage protection alternatives includes the following:

Just a Bunch of Disks (JBOD): A disk volume with no RAID protection is referred to as just a bunch of disks configuration, or (JBOD). This represents a configuration of disks with no protection and no performance optimization.

RAID 0: A disk volume in a RAID 0 configuration provides striping of data across several disks in the storage array. Striping supports parallel disk controller access to data across several disks reducing the time required to locate and transfer the requested data. Data is transferred to array cache once it is found on each disk. RAID 0 striping provides optimum data access performance with no data protection. One hundred percent of the disk volume is available for data storage.

RAID 1: A disk volume in a RAID 1 configuration provides mirror copies of the data on disk pairs within the array. If one disk in a pair fails, data can be accessed from the remaining disk copy. The failed disk can be replaced and data restored automatically from the mirror copy without bringing the storage array down for maintenance. RAID 1 provides optimum data protection with minimum performance gain. Available data storage is limited to 50 percent of the total disk volume, since a mirror disk copy is maintained for every data disk in the array.

RAID 3 and 4: A disk volume in a RAID 3 or RAID 4 configuration supports striping of data across all disks in the array except for one parity disk. A parity bit is calculated for each data stripe and stored on the parity disk. If one of the disks fails, the parity bit can be used to recalculate and restore the missing data. RAID 3 provides good protection of the data and allows optimum use of the storage volume. All but one parity disk can be used for data storage, optimizing use of the available disk volume for data storage capacity.

There are some technical differences between RAID 3 and RAID 4, which, for our purposes, are beyond the scope of this discussion. Both of these storage configurations have potential performance disadvantages. The common parity disk must be accessed for each write, which can result in disk contention under heavy peak user loads. Performance may also suffer because of requirements to calculate and store the parity bit for each write. Write performance issues are normally resolved through array cache algorithms on most high-performance disk storage solutions.

The following RAID configurations are the most commonly used to support ArcSDE storage solutions. These solutions represent RAID combinations that best support data protection and performance goals.


Figure 5.27 RAID 1/0 provides optimum protection and optimizes query performance.

Figure 5.27 shows the RAID 1/0 storage disk layout. RAID 1/0 is a composite solution including RAID 0 striping and RAID 1 mirroring.

Best practice: High-activity database index tables and log files are best located on RAID 1/0 storage volumes.


Figure 5.28 RAID 5 provides optimum protection and good performance.

Figure 5.28 shows the RAID 5 storage disk layout. RAID 5 supports striping of data across all disks in the array except for one parity disk.

Best practice: GIS feature data and imagery files can be located on striped RAID 5 disks with minimum performance impact.


Figure 5.29 RAID 6 provides better protection and better query performance.

Figure 5.29 shows the RAID 6 storage disk layout. RAID 6 supports striping of data across all disks in the array except for two parity disks.

Best practice: Use RAID 6 to reduce concern of data loss due to a two concurrent disk failure scenario.

Will storage be the next performance bottleneck?

Technology is improving display performance and moving more data faster and more efficiently throughout the server, network, and storage infrastructure.

Warning: All technology advances point toward increased potential for disk contention.

The good news is that there are technical solutions available to resolve contention. It is also quite simple to monitor disk I/O performance and identify if disk contention is a problem. Available solutions to disk contention include:

Moving to solid state storage technology

Figure 5.29.1 Solid State Technology is replacing mechanical drives as a foundation for high performance operations and mobile deployment.
A Hard Disk Drive (HDD) works by way of a mechanical drive head that must physically move to access locations on a rapidly-spinning magnetic disk. Rotating disks must wait for spindle motors, heads, and arms to physically locate data sectors.

HDD impacts on display performance vary based on workflow data access patterns, the location of data on the disk, and the disk rotation speed. HDD disk rotation speed impacts the capacity of the drives and the cost. Higher capacity drives are available at the lower rotation speeds delivering a significantly lower cost per GB storage. The number of disk in a RAID storage volume also impacts performance (more disk enable higher parallel throughput performance).

A solid state drive (SSD), on the other hand, has no moving parts and is capable of accessing any location on the drive with equally fast speed and precision. SSD read and write performance can be 100 times faster than HDD, with a much smaller form factor and much less power consumption. Samsung Solid State Drive White Paper provides a very helpful overview of the current state of SSD technology and what you need to know when selecting the right storage technology solution for your environment.

Flash memory technology

Solid state storage technology started out as flash memory. All Flash devices have certain basic properties in common:

There are two basic types of flash chips: NOR and NAND (named after the NOR and NAND logical programming gates). NOR Flash chips support random access with execute-in-place capability and is commonly used to run code (direct read only). NAND Flash chips can store approximately four times as much data as NOR for the same price, delivers much faster erase and write times, and is the chip of choice for Solid State Drive technology.

Solid State Drive technology

Figure 5.29.2 Solid State Drive technology advances deliver enterprise class storage in a competitive marketplace.
There are three classes of NAND SSD chips. SSD classes are defined by the number of electrical charges that are stored in each NAND cell.

The more bits a cell stores at one time, the more capacity that fits in one place reducing manufacturing costs and increasing SSD capacity. Manek Dubash shares his review on MLC vs SLC: Which flash SSD is right for you?

SSD endurance (maximum erase cycles) varies based on technology class. Multi-Level Cell configurations are increasingly sensitive to electric charge deterioration (wear out).

Enhanced SSD controller capabilities have advanced MLC cell endurance through a variety of amelioration techniques creating an attractive enterprise (eMLC) chip configuration. Amelioration techniques include:

The eMLC SSD configuration significantly reduces the cost of storage by providing SLC class endurance (100,000 erase/write cycles) at the MLC class capacity (4 times SLC class). Hitachi shares their release of SLC NAND Flash and eMLC NAND Flash Enterprise-class SSDs.

Hierarchical storage implementation

Figure 5.29.3 Hierarchical storage architecture introduces solid state storage as an integrated extension of existing storage solutions.
Most of the enterprise storage market today is supported by HDD technology. SSD technology will not replace spinning disks any time soon – so many of the popular enterprise storage solutions involve implementation of a hierarchical storage architecture. With hierarchical storage, only the working data set needs to be on SSD - and typically that’s about 5 to 15 percent of the total on-line data repository.

Most new laptops and mobile devices are now supported by SSD technology. Intel is starting to include SSD storage on PCI ports with the computer motherboard, enabling optimum bandwidth access to critical cached data sources. A variety of storage vendors are providing SSD based gateway products that connect to existing SAN and NAS storage, storing a cached copy of working business data on eMLC SSD components with historical resources retained on existing HDD storage. These hierarchical storage solutions deliver SSD class operational performance gains while continuing to leverage existing investment in HDD technology.

Best practice:
1) Be aware of the potential for performance bottlenecks within your storage environment.
2) Monitor disk I/O performance to identify when disk contention is causing a performance delay.
3) Be aware that there are solutions to disk I/O performance problems, and take appropriate action to address performance issues when they occur.
4) Consider Solid State Disk storage solutions for future technology investments.


Ways to move GIS data

GIS data can be moved using an extract and load process or by forms of replication. The optimum method for moving your data will be determined by the amount of data to be moved and the associated business processes.


Traditional tape backup/disk copy

Figure 5.30 Disk or tape backup remain the most popular ways to move large volumes of GIS data.

Figure 5.30 shows ways to move large datasets. Large volumes of data are best moved on disk, DVD, or tape back-up.

Moving large volumes of data across shared network resources is very expensive, and impacts performance for all users on that network segment. Network bandwidth is a valuable and limited resource. If you conserve network bandwidth for primary display and query tasks, you will have more capacity available for better display performance.


Database replication

Figure 5.31 Database replication is a good solution when you are keeping a replicated copy of all the database transactions.

Figure 5.31 shows a database replication architecture. Commercial database replication works well for maintaining a back-up failover database.

Best practice: Use database replication when you want to create and maintain a complete copy of the database environment. Database vendors provide optimum tools for data protection, back-up, and recovery operations.
Warning: ArcSDE geodatabase replication should not be used for back-up and recovery operations.


Disk-level replication

Figure 5.32 Storage-level replication provides back-up to a separate storage location.

Figure 5.32 shows a storage replication architecture. Storage disk-level replication provides the optimum solution for data center back-up and recovery operations.

Storage vendors typically provide incremental snapshot back-ups to local and remote data center locations. Many of these solutions include proven recovery tools and are widely in use. Back-up from storage volumes avoids database complexity issues, data is replicated at the disk storage block-level.

Best practice: Use storage disk replication when you want to create and maintain a complete copy of the data storage environment. Storage vendors provide optimum tools for data protection, back-up, and recovery operations.


Protect your GIS data resources

Figure 5.33 Snapshot backup can protect your data investment.

Data protection at the disk level minimizes the need for system recovery in the event of a single disk failure but will not protect against a variety of other data failure scenarios. It is always important to keep a current backup copy of critical data resources, and maintain a recent copy at a safe location away from the primary site. Figure 5.33 highlights data backup strategies available to protect your business operations. It is important to maintain a snapshot back-up or copy of your data. A large percentage of data loss is caused by human error. The best way to protect your data is to maintain a reliable periodic point-in-time back-up strategy.

The type of backup system you choose for your business will depend on your business needs. For simple low priority single use environments, you can create a periodic point-in-time backup on a local disk or tape drive and maintain a recent off-site copy of your data for business recovery. For larger enterprise operations, system availability requirements may drive requirements for failover to backup Data Centers when the primary site fails. Your business needs will drive the level of protection you need.

Data backups provide the last line of defense for protecting our data investments. Careful planning and attention to storage backup procedures are important factors to a successful backup strategy. Data loss can result from many types of situations, with some of the most probable situations being administrative or user error.

Host Tape Backup: Traditional server backup solutions use lower-cost tape storage for backup. Data must be converted to a tape storage format and stored in a linear tape medium. Backups can be a long drawn out process taking considerable server processing resource (typically consume a CPU during the backup process) and requiring special data management for operational environments.

For database environments, point-in-time backups are required to maintain database continuity. Database software provide for online backup requirements by enabling a procedural snapshot of the database. A copy of the protected snapshot data is retained in a snapshot table when changes are made to the database, supporting point-in-time backup of the database and potential database recovery back to the time of the snapshot.

Host processors can be used to support backup operations during off-peak hours. If backups are required during peak-use periods, backups can impact server performance.

Network Client Tape Backup: The traditional online backup can often be supported over the LAN with the primary batch backup process running on a separate client platform. DBMS snapshots may still be used to support point-in-time backups for online database environments. Client backup processes can contribute to potential network performance bottlenecks between the server and the client machine because of the high data transfer rates during the backup process.

Storage Area Network Client Tape Backup: Some backup solutions support direct disk storage access without impacting the host DBMS server environment. Storage backup is performed over the SAN or through a separate storage network access to the disk array with batch process running on a separate client platform. A disk-level storage array snapshot is used to support point-in-time backups for online database environments. Host platform processing loads and LAN performance bottlenecks can be avoided with disk-level backup solutions.

Disk Copy Backup: The size of databases has increased dramatically in recent years, growing from tens of gigabytes to hundreds of gigabytes and, in many cases, terabytes of data. Recovery of large databases from tape backups is very slow, taking days to recover large spatial database environments. At the same time, the cost of disk storage has decreased dramatically providing disk copy solutions for large database environments competitive in price to tape storage solutions. A copy of the database on local disk, or a copy of these disks to a remote recovery site, can support immediate restart of the DBMS following a storage failure by simply restarting the DBMS with the backup disk copy.

There are several names for disk backup strategies (remote backup, disaster recovery plan, business continuance plan, continuation of business operations plan, etc). The important thing is that you consider your business needs, evaluate risks associated with loss of data resources, and establish a formal plan for business recovery in the event of data loss.

The question is not if, but when. Most people will, at some time, experience loss of valuable data and business resources.

Data Management Overview

Support for distributed database solutions has traditionally introduced high-risk operations, with potential for data corruption and use of stale data sources in GIS operations. There are organizations that support successful distributed solutions. Their success is based on careful planning and detailed attention to their administrative processes that support the distributed data sites. More successful GIS implementations support central consolidated database environments with effective remote user performance and support. Future distributed database management solutions may significantly reduce the risk of supporting distributed environments. Whether centralized or distributed, the success of enterprise GIS solutions will depend heavily on the administrative team that keeps the system operational and provides an architecture solution that supports user access needs.

CPT Capacity Planning videos

The next chapter will discuss Network Communications, providing some insight on how to build GIS solutions that support remote user performance and Enterprise GIS scalability.

Previous Editions

GIS Data Administration 35th Edition
GIS Data Administration 34th Edition
GIS Data Administration 33rd Edition
GIS Data Administration 32nd Edition
GIS Data Administration 31st Edition
GIS Data Administration 30th Edition
GIS Data Administration 29th Edition
GIS Data Administration 28th Edition
GIS Data Administration 27th Edition

System Design Strategies (select here for table of contents)
1. System Design Process 2. GIS Software Technology 3. Software Performance 4. Server Software Performance
5. GIS Data Administration 6. Network Communications 7. GIS Product Architecture 8. Platform Performance
9. Information Security 10. Performance Management 11. System Implementation 12. City of Rome
A1. Capacity Planning Tool B1. Windows Memory Management Preface (Executive Summary) SDSwiki What's New

Page Footer
Specific license terms for this content
System Design Strategies 26th edition - An Esri ® Technical Reference Document • 2009 (final PDF release)

Navigation
Need Help
Toolbox
Share This Page