GIS Data Administration 28th Edition (Fall 2010)

Fall 2010 GIS Data Administration 28th Edition

Data management is a primary consideration when developing enterprise GIS  architectures. Enterprise GIS normally benefits from efforts to consolidate agency data resources. There are several reasons for supporting data consolidation. These reasons include improving user access to data resources, providing better data protection, and  enhancing the quality of the data. Consolidation of IT support resources also reduces hardware cost and the overall cost of system  administration.

The simplest and most cost-effective way to manage data resources is to keep one copy of the data in a  central data repository and provide user access for data maintenance and  operational GIS query and analysis needs. This is not always practical, and many business operations require that organizations maintain  distributed copies of the data. Significant compromises may have to be made to support a distributed data architecture.

This section provides an overview of GIS data management technology patterns. Several basic data management tasks will be identified along with the current state of technology to support these tasks. These data management tasks include ways to manage, serve, move, store, protect,  and back-up spatial data.

Management of GIS data resources is slightly different for spatial vector and raster imagery  data sources. Imagery is fully integrated into GIS with the ArcGIS 10 release. Vector data is managed as features within the geodatabase and imagery is managed as a Mosaic Dataset. Both are managed within the ArcGIS Desktop ArcCatalog application.

GIS Spatial Data Architecture Patterns
Figure 4-1 provides an overview of the GIS spatial data architecture patterns. GIS spatial data include points, polygons, lines and their associated attributes. Data is stored in geodatabase feature tables, where each row in the table represents a spatial feature and its associated  attributes. Complex features and the parcel fabric introduced with ArcGIS 10 are special extensions of these same features.



Lidar point elevation and terrain data are currently stored and managed in a  geodatabase. Future plans are to manage these data as raster datasets with the ArcCatalog imagery management tools.

GIS Spatial Data Management
The ArcSDE Geodatabase was designed to manage GIS spatial data. The ArcSDE schema provides a multi-versioned data maintenance environment which  can include dependencies and relationships to maintain quality and  integrity of features entered into the geodatabase. Shape files, CAD files, and location/elevation data are a few examples of data that can  be managed in an ArcSDE Geodatabase.

Versioned ArcSDE Geodatabase servers provide the central GIS data repository for many  enterprise GIS implementations. These maintenance databases can maintain thousands of concurrent edit sessions (work in progress) in  many different states of completion.

GIS Spatial Data Distribution Database
Many large organizations have used geodatabase replication to establish and  maintain a separate read only distribution or publishing database that  contains only the published data layers in a simple feature format. In many enterprise environments, most workflows require read only access to  the spatial data. The simple feature structure improves display performance and server capacity for optimum access by enterprise  business users.

ArcGIS Server map caching services make it feasible to create and maintain an optimized map cache of the more  static vector basemap layers. This pyramid map cache can be accessed as a preprocessed basemap layer, with the more dynamic operational layers  served from the distribution database. GIS client applications will combine the operational business layers with the basemap in the map  display. Preprocessed map cache provides the fastest access and the most scalable GIS data source.

ArcGIS Server provides services for Full, Partial, and on demand caching services. ArcGIS 10 provides a compact cache map format that reduces storage volume by up to  90 percent and improves disk access performance. ArcGIS 10 also supports mixed mode cache tile formats which allows most of the imagery  to be stored in the highest compressed JPEG format while providing  transparent PNG24 format boundary tiles. Mixed mode cache tile formats are important when supporting incremental cache updates to the map cache  repository.

History Geodatabase
Historical records of GIS features are becoming popular as temporal GIS workflow  analysis becomes more important as a business information resource. ArcSDE provides a history geodatabase for maintaining time stamped records of geodatabase state changes over time. As feature changes are made to the maintenance database, copy of the feature is replicated to  the history database instance as a time stamped record of that state. Temporal views of the history dataset can be accessed based on the time stamp history.

Ways to Manage and Access Spatial Data
Release of ArcGIS technology introduced the ArcSDE geodatabase, which provides a  way to manage long transaction edit sessions within a single database  instance. ArcSDE supports long transactions using versions (different views) of the database. A geodatabase can support thousands of concurrent versions of the data within a single database instance. The default version represents the real world, and other named versions are  proposed changes and database updates in work.

Figure 4-2 shows a typical long transaction workflow life cycle. The workflow represents design and construction of a typical housing subdivision. Several design alternatives might initially be represented as separate named versions in the database to support planning for a new  subdivision. One of these designs (versions) is approved to support the construction phase. After the construction phase is complete, the selected design (version) is modified to represent the as-built  environment. Once development is completed, the final design version will be reconciled with the geodatabase and posted to the default  version to reflect the new subdivision changes.



The simplest way to introduce the versioning concept in the geodatabase is  by using some logical flow diagrams. Figure 4-3 demonstrates the explicit state model represented in the geodatabase. The default version lineage is represented in the center of the diagram, and a new default  version state is added each time edits are posted to the default view. Each edit post represents a state change in the default view (accepted changes to the real-world view). There can be thousands of database changes (versions) at a time. As changes are completed, these versions are posted to the default lineage.





The new version on the top of the diagram shows the life cycle of a long  transaction. The transaction begins as changes from "state 1" of the default lineage. Maintenance updates reflected in that version are represented by new states in the edit session (1a, 1b, and 1c). During the edit session, the default version accepts new changes from other  completed versions. The new version active edit session is not aware of the posted changes to the default lineage (2, 3, 4, and 5) since it is  referenced from default state 1. Once the new version is complete, it must be reconciled with the default lineage. The reconcile process compares the changes in the new version (1a, 1b, and 1c) with changes in  the default lineage (2, 3, 4, and 5) to make sure there are no edit  conflicts. If the reconcile process identifies conflicts, these conflicts must be resolved before the new version can be posted to the  default lineage. Once all conflicts are resolved, the new version is posted to the default lineage forming state 6.



Figure 4-4 shows a typical workflow history of the default lineage. Named versions (t1, t4, and t7) represent edit transactions in work that have  not been posted back to the default lineage. The parent states of these versions (1, 4, and 7) are locked in the default lineage to support the  long edit sessions that have not been posted. The default lineage includes several states (2, 3, 5, and 6) that were created by posting  completed changes.

Figure 4-5 demonstrates a geodatabase compress. Very long default lineages (thousands of states) can impact database performance. The geodatabase compress function consolidates all default changes into the named version parent states,  thus decreasing the length of the default lineage and improving database  performance.



Now that the geodatabase versioning concept is understood, it is helpful to  recognize how this is physically implemented within the database table  structure. When a feature table within the geodatabase is versioned, two new tables are created to track changes to the base feature table. An Adds Table is created to track additional rows added to the base feature  table, and a Deletes Table is created to record deleted rows from the  Base Table. Each row in the Adds and Deletes tables represents change states within the geodatabase. As changes are posted to the default version, these changes are represented by pointers in the Adds and  Deletes tables. Once there is a versioned geodatabase, the real-world view (default version) is represented by the Base Table plus the Adds  and Deletes tables included in the default lineage (the Base Table does  not represent default). All outstanding versions must be reconciled and posted to compress all default changes back to the Base Table (zero  state). This is not likely to occur for a working maintenance database in a real-world environment.

ArcSDE Geodatabase
The ArcGIS technology includes a spatial database engine (ArcSDE) for  managing and sharing GIS data. figure 4-6 provides an overview of the ArcSDE components.

Every Esri software product includes an ArcSDE communications client. The ArcSDE schema includes relationships and dependencies used to manage geodatabase  versioning and replication functionality. The ArcSDE schema also includes the geodatabase license code stored in host DBMS tables. ArcSDE also includes an executable that translates communications between  ArcGIS ArcObjects and the supported DBMS. The ArcSDE executable is included in the ArcGIS ArcObject DBMS direct connect application program  interface (api), and is also available for install on the DBMS server  or middle server tier as a separate application executable (GSRVR).

 Geodatabase Evolution. ArcSDE has evolved from an initial binary schema and  spatial storage types to the current XML schema with SQL spatial storage  types. Figure 4-7 shows the evolution cycle improving spatial data access, enhanced performance and scalability, and including a larger  collection of supported spatial storage data types.



The GIS spatial and attribute data are stored in relational database  tables. The ArcSDE and User schema defines the geodatabase table structure, relationships, and dependencies. Figure 4-8 provides a representation of the Base Table, Adds Table, and Deletes Table in a  versioned geodatabase. 

Ways to Move Spatial Data
Many enterprise GIS solutions require continued maintenance of distributed  copies of the GIS data resources, typically replicated from a central  GIS data repository or enterprise database environment. Organizations with a single enterprise database solution still have a need to protect  data resources in the event of an emergency such as fire, flood,  accidents, or other natural disasters. Many organizations have recently reviewed their business practices and updated their plans for business  continuance in the event of a major loss of data resources. The tragic events of September 11, 2001, demonstrated the value of such plans and  increased interest and awareness of the need for this type of  protection.

This section reviews the various ways organizations move spatial data. Traditional methods copy data on tape or disk and physically deliver this data to the remote site through  standard transportation modes. Once at the remote site, data is reinstalled on the remote server environment. Technology has evolved to provide more efficient alternatives for maintaining distributed data  sources. Understanding the available options and risks involved in moving data is important in defining optimum enterprise GIS  architecture.

Traditional Data Transfer Methods
Figure 4-9 identifies traditional methods for moving a copy of data to a remote location.

Traditional methods include backup and recovery of data using standard tape or disk  transfer media. Moving data using these methods is commonly called "sneaker net." These methods provide a way to transfer data without the support of a physical network.

Tape Backup: Tape backup solutions can be used to move data to a separate server  environment. Tape transfers are normally very slow. The reduced cost of disk storage has made disk copy a much more feasible option.

Disk Copy: A replicated copy of the database on disk storage can support  rapid restore at a separate site. The database can be restarted with the new data copy and online within a short recovery period.



Database Replication
Customers have experienced a variety of technical challenges when configuring  DBMS spatial data replication solutions. ArcSDE data model modifications may be required to support DBMS replication solutions. Edit loads will be applied to both server environments, contributing to potential  performance or server sizing impacts. Data changes must be transmitted over network connections between the two servers, causing potential  communication bottlenecks. These challenges must be overcome to support a successful DBMS replication solution. Customers have indicated that DBMS replication solutions can work but require a  considerable amount of patience and implementation risk. Acceptable solutions are available through some DBMS vendors to support replication  to a read-only backup database server. Dual master server configurations significantly increase the complexity of an already  complex replication solution. Figure 4-10 presents the different ways to move spatial data using database replication.



Synchronous Replication. Real-time replication requires commitment of data  transfer to the replicated server before releasing the client  application on the primary server. Edit operations with this configuration would normally result in performance delays because of the  typical heavy volume of spatial data transfers and the required client  interaction times. High-bandwidth fiber connectivity (1000 Mbps bandwidth) is recommended between the primary server and the replicated  backup server to minimize performance delays.

Asynchronous Replication. Near real-time database replication strategies decouple  the primary server from the data transfer transaction to the secondary  server environment. Asynchronous replication can be supported over WAN connections, since the slow transmission times are isolated from primary  server performance. Data transfers (updates) can be delayed to off-peak periods if WAN bandwidth limitations dictate, supporting periodic  updates of the secondary server environment at a frequency supporting  operational requirements.

Disk-Level Replication
Disk-level replication is a well-established technology, supporting global  replication of critical data for many types of industry solutions. Spatial data is stored on disk sectors very similar to any other data storage and, as such, does not require special attention beyond what  might be required for other data types. Disk volume configurations (data location on disk and what volumes are transferred to the remote site)  may be critical to ensure database integrity. Mirror copies are refreshed based on point-in-time snapshot functions supported by the  storage vendor solution. Disk-level replication provides transfer of block-level data changes on disk to a mirror disk volume  located at a remote location. Transfer can be supported with active online transactions with minimum impact on DBMS server performance  capacity. Secondary DBMS applications must be restarted to refresh the DBMS cache and processing environment to the point in time of the  replicated disk volume. Figure 4-11 presents the different ways to move spatial data using disk level replication.



Synchronous Replication Real-time replication requires commitment of data  transfer to the replicated storage array before releasing the DBMS  application on the primary server. High-bandwidth fiber connectivity (1000 Mbps bandwidth) is recommended between the primary server and the  replicated backup server to avoid performance delays.

Asynchronous Replication: Near real-time disk-level replication strategies  decouple the primary disk array from the commit transaction of changes  to the secondary storage array environment. Asynchronous replication can be supported over WAN connections, since the slow transmission times  are isolated from primary DBMS server performance. Disk block changes can be stored and data transfers delayed to off-peak periods if WAN  bandwidth limitations dictate, supporting periodic updates of the  secondary disk storage volumes to meet operational requirements.

ArcGIS Geodatabase Transition
Moving subsets of a single database cannot normally be supported with standard  backup strategies. Data must be extracted from the primary database and imported into the remote database to support the data transfer. Database transition can be supported using standard ArcGIS export/import functions. These tools can be used as a method of establishing and maintaining a copy of the database at a separate location. Figure 4-12 identifies ways to move spatial data using ArcGIS data transition  functions.



ArcSDE Admin Commands: Batch process can be used with ArcSDE admin commands  to support export and import of an ArcSDE database. Moving data using these commands is most practical when completely replacing the data  layers. These commands are not optimum solutions when transferring data to a complex ArcSDE geodatabase environment.

ArcCatalog/ArcTools Commands: ArcCatalog supports migration of data between ArcSDE  geodatabase environments, extracts from a personal geodatabase, and  imports from a personal geodatabase to an ArcSDE environment.



Distributed Geodatabase
ArcSDE manages the versioning schema of the geodatabase and supports client  application access to the appropriate views of the geodatabase. ArcSDE also supports export and import of data from and to the appropriate  database tables and maintains the geodatabase scheme defining  relationships and dependencies between the various tables.

Geodatabase Single-Generation Replication
The ArcGIS 8.3 release introduced a disconnected editing solution. This solution provides a registered geodatabase version extract to a personal geodatabase or separate  database instance for disconnected editing purposes. The version adds/deletes values are collected by the disconnected editor and, on  reconnecting to the parent server, can be uploaded to the central ArcSDE  database as a version update.

Figure 4-13 presents an overview of the ArcGIS 8.3 disconnected editing with checkout to a  personal geodatabase (PGD). The ArcGIS 8.3 release is restricted to a single checkout/check-in transaction for each client edit session.

 Figure 4-14 presents an overview of the ArcGIS 8.3 disconnected editing with  checkout to a separate ArcSDE geodatabase. The ArcGIS 8.3 release is restricted to a single checkout/ check-in transaction for each child  ArcSDE database. The child ArcSDE database can support multiple disconnected or local version edit sessions during the checkout period. All child versions must be reconciled before check-in with the parent ArcSDE database (any outstanding child versions will be lost during the  child ArcSDE database check-in process). 

Geodatabase One-way Multi-generation Replication
The ArcGIS 8.3 database checkout functions provided with disconnected editing can be used to support peer-to-peer database  refresh. Figure 4-15 shows a peer-to-peer database checkout, where ArcSDE disconnected editing functionality can be used to periodically  refresh specific feature tables of the geodatabase to support a separate  instance of the geodatabase environment. This functionality can be used to support a separate distribution view-only geodatabase that can be  configured to support a nonversioned copy of the default version.

The ArcGIS 9.2 software introduced support for incremental updates between ArcSDE geodatabase environments.



Geodatabase Two-way Multi-generation Replication
The ArcGIS disconnected editing functionality was expanded in with the  ArcGIS 9 releases to support loosely coupled ArcSDE distributed database  environments. Figure 4-16 presents an overview of the loosely coupled ArcSDE distributed database concept.

Multi-generation replication supports a single ArcSDE geodatabase distributed over  multiple platform environments. The child checkout versions of the parent database supports an unlimited number of update transactions  without losing local version edits or requiring a new checkout. Updates are passed between parent and child database environments through simple  datagrams that can be transmitted over standard WAN communications. This new geodatabase architecture supports distributed database environments over multiple sites connected by limited bandwidth  communications (only the reconciled changes are transmitted between  sites to support database synchronization). 

Figure 4-17 provides an overview of common ArcGIS Server geodatabase use case scenarios.

Regional Offices can be supported by two-way multi-generation replication  synchronizing with the central Data Center corporate SDE Geodatabase  server. Central server maintaining the land base layers and remote offices updating the operational layers.

Mobile users can work in the field, receiving project updates and synchronizing with the central Enterprise geodatabase.

Federated hierarchical data exchange provide incremental updates from local, to  state, to federal Geodatabase levels - filtered as appropriate for each  level of access.

Distribution (Publication) geodatabase environments can be incrementally updated from Maintenance (Production)  database for read only access by large communities of users.

Geodatabase replications establishes a framework for a broad variety of distributed Geodatabase operations. 

Distributed Data Architecture Strategies
Geodatabase replication is becoming more important as enterprise organization are  expanding and managing their data across multiple data centers. Figure 4-18 shows an Enterprise Data Center supporting a variety of remote site  clients (stand alone ArcGIS Desktop, CAD clients, Citrix terminal  clients, and mobile GIS viewers). ArcGIS Server can be used to replicate Enterprise GIS data resources from the maintenance database to  remote publishing databases maintained at a separate Data Center or in a  published cloud computing environment. 

Figure 4-19 shows how replication services are leveraged to support a  Federated architecture. Regional Data Centers can host maintenance databases from multiple municipal organizations reducing overall  administrative costs for the region. Each municipality can publish to their database of record (distribution database) for sharing with the  community. Subsets (filtered versions) of the different Municipal publication databases can be integrated at a National or Global level  and then published again as a National dataset. Any of the server levels can be hosted by private Data Centers, private cloud hosting  providers, or the final copy published on a public cloud hosting  facility. All of this is made possible with ArcGIS Server Geodatabase replication services.



GIS Raster Imagery Data Architecture
Figure 4-20 provides an overview of the GIS image data architecture patterns. GIS image data include Arial Photography and Satellite Imagery delivered in a variety of storage formats (TIFF, IMG, MRSID, JPG2000,  etc). Data is stored in its delivered source format. This is important, since every time you manipulate or change the imagery format  you lose quality. Imagery can be quite large (often measured in 100s of Gigabytes, Terabytes, or even Petabytes of data). There is a real advantage in moving the data "as is" directly to your storage  environment.



Lidar point elevation and terrain data are currently stored in a geodatabase  and managed with the vector data. Future plans are to manage these data as raster datasets with ArcCatalog imagery management tools.

GIS Raster Data Management
The Mosaic Dataset was designed to manage GIS raster data. A Mosaic Dataset is created using ArcGIS Desktop ArcCatalog and provides  on-the-fly processing of the raw imagery data sources. The Mosaic Dataset is created within a host geodatabase.

GIS Raster Data Access
Imagery can be accessed from ArcGIS Desktop or through ArcGIS Server Image  Services. ArcGIS Desktop clients have full access to imagery through the Mosaic Dataset. The ArcGIS Server Image Service can access single image catalogs directly. ArcGIS Server Image Extension enables image service access to Imagery through a published Mosaic Dataset.

ArcGIS Server map caching makes it feasible to create and maintain an  optimized map cache of the Imagery layers. The imagery map cache is an optimized tiled layer including a pyramid of standard map scales. Imagery map cache can be accessed as seamless preprocessed map tiles, brought together as an Imagery basemap. GIS client application overlays dynamic business layers (spatial data) over the Imagery basemap in the  map display. Preprocessed map cache image pyramids provide the fastest access and the most scalable GIS data source.

ArcGIS Server provides Full, Partial, and on demand caching services. ArcGIS 10 includes a compact map cache format that reduces storage volume by up  to 90 percent and improves disk access. ArcGIS 10 supports mixed mode cache tile formats which allow most of the imagery to be stored in the  highest compressed JPEG format with transparent PNG24 format boundary  tiles. Mixed mode cache tile formats are important for incremental cache updates to the Imagery map cache repository.

Historical Imagery Online Management
Historical records of GIS features are becoming popular as temporal GIS workflow  analysis becomes more important as a business information resource. Imagery metadata formats include time stamped images. Storage vendors provide highly scalable content storage solution that can manage access  and protection of file data sources over long periods of time. These storage solutions are combined with hierarchical storage management,  moving inactive data files to lower cost media based on usage  requirements. Hierarchical storage management in conjunction with content storage technology is an evolving solution for online temporal  access to unlimited capacity Imagery archives.

ArcGIS Imagery Access Patterns
Once Imagery is available on a local network file share, ArcGIS Desktop can  be used to author a Mosaic Dataset for multiple image data sources or  establish an Image Catalog for a single Image data source. Imagery can be published and accessed by ArcGIS Desktop clients or through ArcGIS  Server Image Services.

What is a Mosaic Dataset?
Figure 4-21 provides an overview of the Mosaic Dataset. The Mosaic Dataset is a catalog of Imagery and rasters, associated metadata, and processing  functions for managing access to online raster data sources. The Mosaic Dataset is stored in a geodatabase and authored using ArcGIS Desktop. The processing functions enable dynamic mosaicking and on-the-fly imagery processing.

Direct Image Access
Figure 4-22 shows ArcGIS Desktop access to read/write to individual Image  files using a traditional workstation. Direct access is available fpr a variety of Image formats, including TIF, IMG, and MrSID. ArcGIS Desktop can also create a Mosaic Dataset of local Image resources for  access and management of the available Imagery inventory.

ArcGIS Server Image Service
Figure 4-23 shows options available for accessing Imagery through ArcGIS  Server Image Service. The ArcGIS Server Image Service provides direct access to preprocessed single Image Datasets. The ArcGIS Server Image Extension expands Image Services to include on-the-fly processing of  multiple Imagery resources utilizing a published Mosaic Dataset. 

Cached Static Imagery Services
Figure 4-24 shows access to a cached image service. ArcGIS Desktop or ArcGIS Server can be used to create and maintain an Imagery Cache. Imagery cache is read only – what you cache is what you get. Imagery is preprocessed and tiled for high performance. This is the same format used by Google and Bing maps – with world-wide coverage of Bing Maps  available online through ArcGIS.com.

Recommended Imagery Workflow
Managing your imagery resources is becoming increasingly important for most  organizations. People expect to see Imagery as a background option in their map display – it is how we relate to our world. Figure 4-25 what is what we recommend for managing your Imagery inventory?

Imagery is provided by Satellite or aerial photography suppliers. Once Imagery is loaded on your local network, use ArcGIS Desktop to author a Mosaic  Dataset of your imagery. The Mosaic Dataset provides local access to your imagery. ArcGIS Desktop clients can access the Mosaic Dataset and leverage on-the-fly processing of multiple image data sources. ArcGIS Server can provide image services from a preprocessed Imagery dataset.

ArcGIS Server Image Extension can leverage the mosaic dataset for dynamic  on-the-fly processing of multiple imagery data sources, providing a full  range of image services to local desktop and distributed Web clients. ArcGIS Server can also be used to create a map cache basemap for serving the larger Web community.

When receiving imagery updates, use ArcGIS Desktop to register updates with the Mosaic Dataset  and ArcGIS Server to update the Image map cache basemap.



Enterprise GIS Data Management
Enterprise GIS Data often includes a mixture of vector business layers, more  stable reference land base layers, and imagery. Figure 4-26 provides a composite overview of enterprise GIS data management, including  management options for map features (business and land base layers) and  raster data (dynamic and static). Historical feature and imagery data is also becoming increasingly important in supporting business analysis  needs. ArcGIS provides a full range of tools for managing your data for optimum use by your organization. 

Ways to Store Spatial Data
Storage technology has evolved over the past 20 years to improve data access  and provide better management of available storage resources. Understanding the advantages of each technical solution will help you select the storage architecture that best supports your needs.

Evolution of Storage Area Networks
Figure 4.27 provides an overview of the evolution of traditional storage from  internal workstation disk to the storage area network architecture.



Internal Disk Storage. The most elementary storage architecture puts the  storage disk on the local machine. Most computer hardware today includes internal disk for use as the storage medium. Workstations and servers can both be configured with internal disk storage. The fact that access to it is through the local workstation or server can be a significant  limitation in a shared server environment: if the server operating  system goes down, there is no way for other systems to access the  internal data resources.

File server storage provides a network share that can be accessed by many client applications within  the local network. Disk mounting protocols (NFS and CIFS) provide local application access over the network to the data on the file server  platform. Query processing is provided by the application client, which can involve a high amount of chatty communications between the client  and server network connection.

Database server storage provides query processing on the server platform, which significantly  reduces the required chatty network communications. Database software also improves data management by providing better administration and  control of the data.

Internal storage can include RAID mirror disk volumes that will preserve the data store in the event of a  single disk failure. Many servers include storage trays that provide multiple disk drives for configuring RAID 5 configurations and  facilitate high capacity storage needs. The internal storage access is limited to the host server, so as many data center environments grew  larger in the 1990s customers would have many servers in their data  center with too much disk (disk not being used), and other servers with  too little disk making disk volume management a challenge (data volumes  could not be shared between server internal storage volumes). External storage architecture (Direct Attached, Storage Area Networks, and  Network Attached Storage) provides a way for organizations to “break  out” from these “silo based” storage solutions and build a more  manageable and adaptive storage architecture.

Direct Attached Storage. A direct attached storage (DAS) architecture  provides the storage disk on an external storage array platform. Host bus adaptors (HBA) connect the server operating system to the external  storage controller using the same block level protocols that were used  for Internal Disk Storage, so from an application perspective the direct  attached storage appears and functions the same as internal storage. The external storage arrays can be designed with fully redundant components (system would continue operations with any single component  failure), so a single storage array product can satisfy high available  storage requirements.

Direct attached storage technology would often provide several fiber channel connections between  the storage controller and the server HBAs. For high availability purposes, it is standard practice to configure two HBA fiber channel  connections for each server environment. Typical Direct Attached Storage solutions would provide from 4 to 8 fiber channel connections, so you  can easily connect up to 4 servers each with redundant fiber channel  connections from a single direct connect storage array controller. Multiple disk storage volumes are configured and assigned to each specific host server, and the host servers would have full access  control to the assigned storage volumes. In a server failover scenario, the primary server disk volumes can be reassigned to the failover  server.

Storage Area Networks. The difference between direct attached storage and a storage area network is the  introduction of a Fiber Channel Switch to establish network connectivity  between multiple Servers and multiple external Storage Arrays. The storage area network (SAN) improves administrative flexibility for  assigning and managing storage resources when you have a growing number  of server environments. The Server HBAs and the External Storage Array controllers are connected to the Fiber Channel Switch, so any Server can  be assigned storage volumes from any Storage Array located in the  storage farm (connected through the same storage network). Storage protocols are still the same as with Direct Attached or Internal Storage  – so from a software perspective, these storage architecture solutions  appear the same and are transparent to the application and data  interface.

Evolution of Network Attached Storage
Network Attached Storage. By the late 1990s, many data centers were using  servers to provide client application access to shared file data  sources. High available environments would require complicated failover clustered file servers, so if one of the file servers fail users would  still have access to the file share. Hardware vendors decided to provide a highbred appliance configuration to handle these network file shares  (called Network Attached Storage or NAS) – the network attached storage  incorporates a file server and storage in a single consolidated high  available storage platform. The file server can be configured with a modified operating system that provides both NFS and CIFS disk mount  protocols, and a storage array with this modified file server network  interface is deployed as a simple network attached storage appliance. The storage appliance includes a standard Network Interface Card (NIC) interface to the local area network, and client applications can connect  to the storage appliance file shares over standard disk mount  protocols. The network attached storage provided a very simple way to deploy a high capacity network file share for access by a large number  of UNIX and/or Windows network clients. Figure 4-28 shows the evolution of the Network Attached Storage architecture.



Network attached storage provides a very effective architecture alternative for  supporting network file shares, and has become very popular among many  GIS customers. When GIS data migrated from early file based data stores (coverages, LIBRARIAN, ArcStorm, Shapefiles) to a more database centric  data management environment (ArcSDE Geodatabase servers), the network  attached storage vendors suggested customers could use a network file  share to support database server storage. There were some limitations: It is important to assign dedicated data storage volumes controlled by  each host database server to avoid data corruption. Other limitations include slower database query performance due to chatty IP disk mount  protocols and bandwidth over the IP network was lower than the Fiber  Channel switch environments (1 Gbps IP networks vs 2 Gbps Fiber Channel  networks) – implementation of Network Attached Storage as an alternative  to Storage Area Networks was not an optimum storage architecture for  geodatabase server environments. Network attached storage was an optimum architecture for file based data sources and use of the NAS technology  alternative continued to grow.

Because of the simple nature of network attached storage solutions, you can use a standard  local area network (LAN) Switch to provide a network to connect your  servers and storage solutions; this is a big selling point for the NAS  proponents. There is quite a bit of competition between Storage Area Networks and Network Attached Storage technology, particularly when  supporting the more common database environments. The SAN community will claim their architecture has higher bandwidth connections and uses  standard storage block protocols. The NAS community will claim they can support your storage network using standard LAN communication protocols  and provide support for both database server and network file access  clients from the same storage solution.

The network attached storage community eventually introduced a more efficient iSCSI  communication protocol for IP network storage networks (SCSI storage  protocols over IP networks). GIS architectures today include a growing number of file data sources (examples include ArcGIS Server Image  Extention, ArcGIS Server pre-processed 2-D and 3-D map cache, and file  geodatabase). For many GIS operations, a blend of these storage technologies (SAN and NAS) provides the optimum storage solution. Introduction of ISCSI over an IP switched storage network architecture provides an attractive compromise for mixed DBMS/File Share storage  requirements.

Ways to Protect Spatial Data
Enterprise GIS environments depend heavily on GIS data to support a variety of  critical business processes. Data is one of the most valuable resources of a GIS, and protecting data is fundamental to supporting critical  business operations.

The primary data protection line of defense is provided by the storage solutions. Most storage vendors have standardized on redundant array of independent disks (RAID) storage  solutions for data protection. A brief overview of basic storage protection alternatives includes the following:

Just a Bunch of Disks (JBOD): A disk volume with no RAID protection is  referred to as just a bunch of disks configuration, or (JBOD). This represents a configuration of disks with no protection and no  performance optimization.

RAID 0: A disk volume in a RAID 0 configuration provides striping of data across several disks  in the storage array. Striping supports parallel disk controller access to data across several disks reducing the time required to locate and  transfer the requested data. Data is transferred to array cache once it is found on each disk. RAID 0 striping provides optimum data access performance with no data protection. One hundred percent of the disk volume is available for data storage.

RAID 1: A disk volume in a RAID 1 configuration provides mirror copies of the data  on disk pairs within the array. If one disk in a pair fails, data can be accessed from the remaining disk copy. The failed disk can be replaced and data restored automatically from the mirror copy without  bringing the storage array down for maintenance. RAID 1 provides optimum data protection with minimum performance gain. Available data storage is limited to 50 percent of the total disk volume, since a mirror disk  copy is maintained for every data disk in the array.

RAID 3 and 4: A disk volume in a RAID 3 or RAID 4 configuration supports  striping of data across all disks in the array except for one parity  disk. A parity bit is calculated for each data stripe and stored on the parity disk. If one of the disks fails, the parity bit can be used to recalculate and restore the missing data. RAID 3 provides good protection of the data and allows optimum use of the storage volume. All but one parity disk can be used for data storage, optimizing use of the  available disk volume for data storage capacity.

There are some technical differences between RAID 3 and RAID 4, which, for  our purposes, are beyond the scope of this discussion. Both of these storage configurations have potential performance disadvantages. The common parity disk must be accessed for each write, which can result in  disk contention under heavy peak user loads. Performance may also suffer because of requirements to calculate and store the parity bit for each  write. Write performance issues are normally resolved through array cache algorithms on most high-performance disk storage solutions.

The following RAID configurations are the most commonly used to support  ArcSDE storage solutions. These solutions represent RAID combinations that best support data protection and performance goals. Figure 4-29 provides an overview of the most popular composite RAID configuration.



RAID 1/0: RAID 1/0 is a composite solution including RAID 0 striping and  RAID 1 mirroring. This is the optimum solution for high performance and data protection. This is also the costliest solution. Available data storage is limited to 50 percent of the total disk volume, since a  mirror disk copy is maintained for every data disk in the array.

RAID 5: RAID 5 includes the striping and parity of the RAID 3 solution  and the distribution of the parity volumes for each stripe across the  array to avoid parity disk contention performance bottlenecks. This improved parity solution provides optimum disk utilization and near  optimum performance, supporting disk storage on all but one parity disk  volume.

Hybrid Solutions: Some vendors provide alternative proprietary RAID strategies to enhance their storage  solution. New ways to store data on disk can improve performance and protection and may simplify other data management needs. Each hybrid solution should be evaluated to determine if and how it may support  specific data storage needs.

ArcSDE data storage strategies depend on the selected database environment.

SQL Server: Log files are located on RAID 1 mirror, and index and data tables are located on RAID 5 disk volume.

Oracle, Informix, and DB2: Index tables and log files are located on RAID  1/0 mirror, and striped data volumes and data tables are located on RAID  5.

Potential Storage Contention Concerns
Technology is improving display performance and moving more data faster and more  efficiently throughout the server, network, and storage infrastructure. For GIS, computer processing speed has traditionally limited how fast we use our data resources. As processing speed and data traffic increases, GIS data access can place new demands on storage subsystems. Figure 4-30 highlights some potential storage performance concerns and identifies some available solutions to address these concerns.

Most storage solutions today use mechanical arms to access data stored on a  rotating disk platter. Disk rotation speed has stayed about the same for a long time (over 10 years). Storage capacity has increased within the same disk volume as media density improves while the average disk  input/output (I/O) access speed for these mechanical drives has not seen  much improvement.

Disk contention is more likely today due to faster display processing times, higher capacity servers, and  larger peak concurrent GIS user loads. The larger disk volumes mean that data is not as spread out across as many disks as was the case with  smaller drives. Other factors, such as disk fragmentation caused by rapid data update cycles and larger volumes of storage, reduce  efficiency in accessing data and increase the potential for data  contention (I/O backups).

GIS map cache data sources remove the need for server processing, further increasing server  capacity and the volume of requests for cached data resources. We have preprocessed a significant portion of the traditional GIS data  resources; access to preprocessed data requires very little processing.

There are some characteristics about data caching that help avoid disk  contention. When a cached file is requested by a client application, the file content will be cached on its way to the client display. If the same file is requested again, it can be delivered from cache and  avoid additional requests to the disk level. Inline caching reduces disk contention risk when there are large demands for exactly the same  file – the file can be delivered from distributed cached sources without  returning to the original disk. This makes it difficult to identify just when disk contention might become a problem – it all depends on the  types of user workflows accessing the data on the same disk and how the  data might be cached in the system.

The good news is that there are technical solutions available when we do experience disk  contention, and also it is quite simple to monitor disk I/O performance  and identify if disk contention is a problem. A variety of vendor solutions are available that accelerate Web access by caching data files  on edge servers or Web accelerator appliances (servers located near the  site communication access points). Disk access can also be improved with RAID data stripping – distributing the data across larger RAID  volumes can reduce the probability of disk contention. Solid state disk drives are available in the current marketplace; solutions that deliver  data over 1000 times faster than current mechanical disk drives. Cost for solid state drives is currently much higher than their mechanical  counterparts – this can change as the solid state market sales volume  increases (vendors are waiting for an opportunity to upgrade our storage  solutions).

So what should we do to minimize disk contention risk? Be aware of the potential for performance bottlenecks within your storage environment. Monitor disk I/O performance to identify when disk contention is causing a performance delay. Be aware that there are solutions to disk I/O performance problems, and take  appropriate action to address performance issues when they occur.

Ways to Back Up Spatial Data
Data protection at the disk level minimizes the need for system recovery in  the event of a single disk failure but will not protect against a  variety of other data failure scenarios. It is always important to keep a current backup copy of critical data resources, and maintain a recent  copy at a safe location away from the primary site. Figure 4-31 highlights data backup strategies available to protect your business  operations.



The type of backup system you choose for your business will depend on your business needs. For simple low priority single use environments, you can create a periodic  point-in-time backup on a local disk or tape drive and maintain a recent  off-site copy of your data for business recovery. For larger enterprise operations, system availability requirements may drive  requirements for failover to backup Data Centers when the primary site  fails. Your business needs will drive the level of protection you need.

Data backups provide the last line of defense for protecting our data  investments. Careful planning and attention to storage backup procedures are important factors to a successful backup strategy. Data loss can result from many types of situations, with some of the most probable  situations being administrative or user error.

Host Tape Backup: Traditional server backup solutions use lower-cost tape  storage for backup. Data must be converted to a tape storage format and stored in a linear tape medium. Backups can be a long drawn out process taking considerable server processing resource (typically consume a CPU  during the backup process) and requiring special data management for  operational environments.

For database environments, point-in-time backups are required to maintain database continuity. Database software provide for online backup requirements by enabling a procedural snapshot of the database. A copy of the protected snapshot data is retained in a snapshot table when changes are made to the  database, supporting point-in-time backup of the database and potential  database recovery back to the time of the snapshot.

Host processors can be used to support backup operations during off-peak  hours. If backups are required during peak-use periods, backups can impact server performance.

Network Client Tape Backup: The traditional online backup can often be supported over the  LAN with the primary batch backup process running on a separate client  platform. DBMS snapshots may still be used to support point-in-time backups for online database environments. Client backup processes can contribute to potential network performance bottlenecks between the  server and the client machine because of the high data transfer rates  during the backup process.

Storage Area Network Client Tape Backup: Some backup solutions support direct disk storage  access without impacting the host DBMS server environment. Storage backup is performed over the SAN or through a separate storage network  access to the disk array with batch process running on a separate client  platform. A disk-level storage array snapshot is used to support point-in-time backups for online database environments. Host platform processing loads and LAN performance bottlenecks can be avoided with  disk-level backup solutions.

Disk Copy Backup: The size of databases has increased dramatically in recent years,  growing from tens of gigabytes to hundreds of gigabytes and, in many  cases, terabytes of data. Recovery of large databases from tape backups is very slow, taking days to recover large spatial database  environments. At the same time, the cost of disk storage has decreased dramatically providing disk copy solutions for large database  environments competitive in price to tape storage solutions. A copy of the database on local disk, or a copy of these disks to a remote  recovery site, can support immediate restart of the DBMS following a  storage failure by simply restarting the DBMS with the backup disk copy.

There are several names for disk backup strategies (remote backup, disaster  recovery plan, business continuance plan, continuation of business  operations plan, etc). The important thing is that you consider your business needs, evaluate risks associated with loss of data resources,  and establish a formal plan for business recovery in the event of data  loss.

Data Management Overview
Support for distributed database solutions has traditionally introduced  high-risk operations, with potential for data corruption and use of  stale data sources in GIS operations. There are organizations that support successful distributed solutions. Their success is based on careful planning and detailed attention to their administrative  processes that support the distributed data sites. More successful GIS implementations support central consolidated database environments with  effective remote user performance and support. Future distributed database management solutions may significantly reduce the risk of  supporting distributed environments. Whether centralized or distributed, the success of enterprise GIS solutions will depend heavily on the  administrative team that keeps the system operational and provides an  architecture solution that supports user access needs.

The next chapter on performance fundamentals will focus on understanding  the technology, presenting the fundamental terms and relationships used  in system architecture design capacity planning.

Previous Editions
[Spring 2010 GIS Data Administration 27th Edition]

Page Footer Specific license terms for this content System Design Strategies 26th edition - An Esri ® Technical Reference  Document • 2009 (final PDF release)