Performance Management

Spring 2019 Performance Management 44th Edition

Esri has implemented distributed GIS solutions since the late 1980s. For many years, distributed processing environments were not well understood, and customers relied on the experience of technical experts to identify hardware requirements to support their implementation needs. Each technical expert had a different perspective on what hardware infrastructure might be required for a successful implementation, and recommendations were not consistent. Many hardware decisions were made based on the size of the project budget, rather than a clear understanding of user requirements and the appropriate hardware technology. Many GIS implementation projects would fail due to poor system design and lack of performance management.

Esri started developing simple system performance models in the early 1990s to document our understanding about distributed processing systems. These system performance models have been used by Esri system design consultants to support distributed computing hardware solutions since 1992. These same performance models have also been used to identify potential performance problems with existing computing environments.

The Capacity Planning Tool was introduced in 2008 incorporating the best of the traditional client/server and web services sizing models providing an adaptive sizing methodology to support future enterprise GIS operations. The new capacity planning methodology is much easier to use and provides metrics to manage performance compliance during development, initial implementation, and system delivery.

This chapter introduces how these design models can be used for performance management. 

System performance factors
Figure 10.1 identifies some key components that contribute to overall system performance. Software technology selection and application design drives the processing loads and network traffic requirements. Hardware and architecture selection establishing processing capabilities and how the processing loads are distributed. Network connectivity establishes infrastructure capacity for handling the required traffic loads.

"Warning: Weakest system component determines overall system performance (performance chain). "

"Best practice: Balanced system design provides optimum user performance at lowest system cost."

Software technology factors
Software design efficiency and level of analysis establishes complexity of the application functions. Data source structure and the size and composition of the data contributes to the complexity of the information the application must work with.

Application:
 * Core software and client application efficiency.
 * Display complexity includes layers per display, features per display extent, functions used to complete the display, and display design for each map scale.
 * Display traffic
 * User workflow activity including user productivity, implementation of heavy workflow tasks, and workflow efficiency (mouse clicks to final display, communication chatter)

Data source:
 * Data source technology including DBMS (data types, indexing, tuning, scalability), file source (File format, structure, indexing, scalability), imagery (Image format, file size, indexing, pre-processing, on-the-fly processing), or cached data source.
 * Geodatabase design including table structure, dependencies, and relationship classes.
 * Data connection including SDE (direct connect, applications server connect) or file source (internal disk, direct attached, network attached).

Hardware technology factors
Hardware design and performance characteristics determine how fast the servers can do work and the volume of work they can handle at one time.


 * Workstation/application server/GIS server including processor core performance, platform capacity (servers), physical memory, network connection, graphics processing unit.
 * Data server including processor core performance, platform capacity, physical memory, and network connection.
 * Network communications including bandwidth, traffic, latency, and application communication chatter.

The system design solution must provide sufficient platform and network capacity to process software loads within peak user performance needs.

"Best practice: CPT Standard Workflows provide proper processing load profile. " 

How is performance managed?
System architecture design provides a framework for identifying a balanced system design and establishing reasonable software processing performance budgets. Performance expectations are established based on selected software processing complexity and vendor published hardware processing capacity. System design performance expectations can be represented by established software processing performance targets. These performance targets can be translated into specific software performance milestones which can be validated during system deployment. Software processing complexity and/or hardware processing capacity can be reviewed and adjusted as necessary at each deployment milestone to ensure system is delivered within the established performance budget.

Our understanding of GIS processing complexity and how this workload is supported by vendor platform technology is based on more than 20 years of experience. A balanced software and hardware investment, with capacity based on projected peak user workflow loads, can reduce cost and ensure system deployment success.

Most project managers clearly understand the importance and value of a project schedule in managing deployment risk associated with cost and schedule. The same basic project management principals can be applied to managing system performance risk. Figure 10.2 shows some basic concepts that can be used in managing performance.

System architecture design framework:
 * CPT provides balanced standard and custom workflow load profiles.
 * Workflow complexity assessment is used to assign reasonable software processing performance budgets.

Workflow complexity assessment:
 * Light complexity represents simple user displays with minimum functional analysis (light processing loads).
 * Medium complexity represents standard workflow performance targets that satisfy most workflows that apply best practice design standards. Medium complexity is roughly twice light complexity processing loads.
 * Heavy complexity represents workflows that include more complex map displays or data models that generate 50 percent more processing than medium complexity workflows.
 * Additional complexity selections (2x medium, 3x medium, 4x medium, …10x medium) are available for establishing much heavier performance targets.

Workflow complexity guidelines:
 * Light complexity is the minimum loads expected based on software technology selection.
 * Medium complexity would support up to 80 percent of selected software technology deployments.
 * Heavy complexity represents user workflows with more complex data models (more layers, more features per layer, and more complex analysis).


 * 2x medium, 3x medium, 4x medium, …10x medium represent much more complex workflow loads that are possible with expanding technology and emerging display details.

Faster hardware processing allows more complex analysis to be included in the user workflows. These heavier complexity workflows (2x, 3x, 4x, …10x medium) may not handle a large number of concurrent users, but with today's technology they can deliver map display results in a reasonably response time (less than 5-sec).

"Best practice: Performance expectations are established based on selected software processing complexity and vendor published hardware processing speed (per core performance). "

Platform throughput and service time
The most important system performance terms define the average work transaction (display), work throughput, system capacity, and system utilization. Figure 10.3 provides a chart showing the relationship between utilization and throughput; a simple relationship that can be used to identify platform capacity.

Capacity (DPM) = Throughput (DPM)/Utilization

"Best practice: If you know the current throughput (users working on the system) and you measure the system utilization (average computer CPU utilization), then you can know the capacity of the server."

The relationship between throughput, capacity, and utilization are true based on how these terms are defined.
 * Throughput is the number of work transactions being processed per unit time.
 * Capacity is the maximum throughput that can be supported by a specific hardware configuration.
 * Utilization is the ratio of the current throughput to the system capacity (expressed as percentage of capacity).

The processor core is the hardware that executes the computer program instructions.
 * Number of processor core identifies how many instances can be serviced at the same time.
 * Service time is a measure of the average work transaction processing time.

Work transaction service time is a key term used to measure software performance.
 * The software program provides a set of instructions that must be executed by the computer to complete a work transaction.
 * The processor core executes the instructions defined in the computer program to complete the work transaction.

Transactions with more instructions represent more work for the computer, while transactions with fewer instructions represent less work for the computer.

The complexity of the computer program workflow can be defined by the amount of work (or processing time) required to complete an average work transaction.
 * Service time on the CPT Workflow tab is presented relative to a platform performance baseline.
 * Faster platform processor cores execute program instructions in less time than slower processor cores.
 * Service time can be computed using a simple formula based on number of processor cores and platform capacity.

Service time (sec) = 60 sec x #core/Capacity (DPM)

Service time can be computed based on measured throughput and utilization. 


 * Service time calculations

Figure 10.4 shows service time results for five different throughput loads.
 * Number of deployed service instances determine peak loads.
 * Throughput and utilization are measured for each of the five separate test configurations.
 * Capacity of 714 DPM was calculated from each test load.
 * Service time of 0.34 sec was calculated from each test load.

"Best practice: You can calculate capacity from throughput and utilization measurements at any system load."

"Note: Real operational environments can provide a very good measure of capacity."

Once you know the platform capacity, you can compute the platform service time. 

Platform performance and response time
Figure 10.5 provides a chart showing the relationship between utilization and response time.

You can calculate display service time if you know the platform throughput and corresponding utilization, calculated at any throughput level. Calculating user display response time for shared system loads is a little bit more difficult.

Calculating user response time:
 * Only one user transaction can be serviced at a time on each processor core.
 * If many user transaction requests arrive at the same time, some of the transactions must wait in line while the others are processed first.
 * Waiting in line for processing contributes to system processing delays.
 * User display response time must include time for all the system component processing times and system delays, since the display is not complete until the final processing is done.

Any system time where a transaction request must wait in line for processing is called queue time.

Response time is the sum of the total service times (processing times) and queue times (wait times) as the transaction request travels across system components to the server and returns to deliver the final user display.

Response time (sec) = Service time (sec) + Queue time (sec)

"Warning: Queue time increases to infinity as any processing component of the system approaches full capacity."

Response time is importance, since it directly contributes to user productivity.

Productivity = 60 sec/(response time + think time)

"Warning: As queue time increases response time will increase and productivity will decrease." 

How to size the network
Figure 10.6 provides a chart showing the relationship between network utilization and response time. Performance models used to support network communications follow the same type of terms and relationships identified for server platforms.

Some of the same performance terms are referenced by different names.
 * Network transaction = display
 * Network throughput = traffic
 * Network capacity = bandwidth
 * Network utilization = utilization

The network connection (switch port, router port, network interface card, hardware bus adapter, etc.) is the hardware that processes the network traffic.
 * Most local networks are identified as single path systems.
 * Multiple NIC cards or multiple network paths can improve throughput utilization.

Additional performance terms:
 * Network service time = network transport time
 * Network queue time = network congestion delays
 * Network latency delay time = measured latency (round trip travel time) x chatter (round trips)

"Best practice: CPT includes network as additional system component when computing system performance."

"Warning: Network performance can be the most critical design constraint for many distributed system design solutions."

Platform queue time
Computing response time is a common problem for many business applications. To get it right, you have to understand queue time. The theory of queues or waiting in line has its origin in the work of A. K. Erlang, starting in 1909.

Figure 10.7 shows a formula for queue time and also a graph showing the relationship between queue time and platform utilization. The number of platform processor core determines the sensitivity of queue time to platform utilization.

The simplest queuing models work for large populations of random arrival transactions, which should certainly be the case when modeling computer computations (thousands of random computer program instructions being executed within a relatively small period of time—e.g., seconds).

The queue time calculations used in the Capacity Planning Tool is a simplified model developed from Operations Research Queuing theory.
 * The second half of the model (single core section) is quite straight forward, and there is general agreement that this simple model would identify wait times in the case of a single service provider (single core platform or single network connection).
 * The multi-core case is a little more complicated, and unfortunately is the more common capacity planning calculations we need to deal with in multi-core server platform configurations.

Queue time model
The single-core platform queue time increases with increasing service time and platform utilization.

Queue time (single-core) = service time (sec) x utilization/(1 - utilization).

Queue time is zero (0) when utilization is zero (0) and increases to infinity as utilization approaches 100 percent.

In the multi-core platform case, it is important to include the probability of a processor core being available to service the request on arrival(not busy).
 * The more processor cores in the server, the more likely one of these cores will be available for processing when the service transaction arrives.
 * The equation simplifies to the simple single-core formula when the number of processor cores = 1.

Multi-core availability = 1/{1 + utilization x (cores - 1)}

Queue time = Multi-core availability x Queue time (single-core)


 * The derived queue time formula provided above has been compared against several benchmark test results, and the computed response time was reasonably close to the measure test results (shows conservative response times—slightly higher than measured values).

It is important to recognize that the accuracy of the queue time calculation impacts only the expected user response time, and does not reduce the accuracy of the platform capacity calculations provided by the earlier simple relationships.
 * For many years, Esri capacity planning models did not include estimates for user response time.
 * Workflow response time is important, since it directly impacts user productivity and workflow validity.
 * If display response times are too slow, the peak throughput estimates would not be achieved and the capacity estimates would not be conservative.

"Best practice: Including user response time in the capacity planning models provides more accurate and conservative platform specifications, and gives customers with a better understanding of user performance and productivity."

Queue time derivatives

 * Display response time based on percent utilization

Multi-core servers provide better response times than single-core servers during heavy loads.
 * Eight 1-core servers provide 2-sec response time at 50 percent utilization.
 * Four 2-core servers provide 2-sec response time at 63 percent utilization.
 * Two 4-core servers provide 2-sec response time at 78 percent utilization.
 * One 8-core server provides 2-sec response time at 88 percent utilization.

"Warning: More cores per server improves response times only when display service times are the same for all configurations."


 * CPT Design multi-core platform performance demonstration

What is system performance?
Figure 10.8 shows the information provided by the CPT Workflow Performance Summary. Workflow service times and queue times are shown in a stacked bar chart. Response time, shown at the height of the stack, is the total time required to complete the work transaction.

The Workflow Performance Summary chart shows the performance of 10 separate benchmark tests.
 * Test were performed on 2-core servers.
 * Number of concurrent batch processes was increased with each test run.
 * First two tests (1 and 2 batch processes) response time was about the same.
 * Response time increased linearly for tests with more than 2 batch processes.

Response time includes all of the processing times and queue times experienced in completing an average work transaction. 
 * Platform service and queue times
 * Network transport and queue times
 * Latency travel time delays
 * Client service time

Server deployment transaction throughput capacity constraints
Several technology factors impact performance and scalability of deployed server systems. Selecting the optimum configuration strategy will help ensure peak system throughput and optimum return on investment. The following technology factors are important in developing an optimum ArcGIS deployment solution.

Virtual Server consolidation
Figure 10.9 shows a typical Enterprise GIS production environment supported by a physical server architecture.

For many years, data centers were supported by physical server configurations. With physical server deployment
 * Many servers were required to support Enterprise operations.
 * Many servers were performing well below their optimum capacity.
 * High number of servers contributed to data center high power consumption.

Figure 10.10 shows a typical Enterprise GIS production environment supported by a virtual server architecture. Virtualization reduces the total number of data center physical servers.

Virtual server machines are deployed on host server platforms. 
 * Multiple virtual machines can be supported by a single host server configuration.
 * Host platforms can run at optimum capacity levels (50 percent to 80 percent utilization).
 * Virtual Server architecture can be deployed to optimize host platform processing loads.

Virtualization: Host server processing loads
Figure 10.11 virtual server utilization capabilities on a host platform with host processor core shared with the hypervisor.

Virtual Server machines (VM) are deployed on a host platform, with access to processing resources controlled by a hypervisor. The hypervisor assigns VM virtual core to host platform hardware CPU resources, allocating available processing resources between the deployed VMs. When host platform CPU resources are limited, the hypervisor must compete with the VM core for access to available host platform resources. 

Figure 10.12 shows virtual server utilization capabilities on a host platform with addition host processor resources available for hypervisor processing.

Hypervisor processing loads are supported directly by the host platform and can be serviced by available host CPU resources separate from the CPU resources assigned to Virtual Server machines (if extra CPU resources are available).

Test results show hypervisor loads may account for up to 35 percent of the total virtual server processing loads. Virtual core for each VM must be assigned to available host platform physical core for processing. Optimum VM throughput is achieved when sufficient host resources are available to support all VM processing requests along with the hypervisor processing load without having to compete for processing resources. As host platform utilization approaches 100 percent, the VM utilization will be limited based on available host resources.


 * CPT Design demonstration of ArcGIS Server Virtual Machine (VM) performance.

Available Virtual Server machine utilization and throughput is limited by hypervisor processing overhead when virtual servers must compete with available host platform processing resources.

"Best practice: Provide host platform with 35 percent more processing capacity than what is required by the virtual servers."

Esri/VMware joint benchmark testing reports.
 * October 2011 Esri ArcGIS Server 10 for VMware Infrastructure Deployment and Technical Considerations Guide includes performance testing of ArcGIS Server 10 with VMware ESXi 3.5u4.
 * July 2013 Esri ArcGIS 10.1+ for Server on VMware vSphere Deployment and Technical Considerations Guide includes performance testing of ArcGIS 10.1 for Server with VMware vSphere 5.1.

Test results show significant virtual server performance improvements with the more recent VMware vSphere technology. The October 2011 testing showed slightly more than 10 percent virtual server processing overhead per core, while the July 2013 testing showed limited performance degradation between physical and virtual server deployment configurations when the virtual host platform performs at levels less than 90 percent utilization.

"Note: July 2013 testing showed virtual server hypervisor overhead of 30 percent running on the host platform (50 percent of the VM loads)." 

Performance Validation
Planning provides the first opportunity for building successful GIS operations. Getting started right, understanding your business needs, understanding how to translate business needs to network and platform loads, and establishing a system design that will satisfy peak user workflow requirements is the first step on your road to success.

Planning is an important first step – but it is not enough to ensure success. If you want to deliver a project within the initial planning budget, you need to identify opportunities along the way to measure progress toward your implementation goal. Compliance with performance goals should be tracked throughout initial development, integration, and deployment - integrate performance validation measurements along the way. Project success is achieved by tracking step by step progress toward your implementation goal, making appropriate adjustments along the way to deliver the final system within the planned project budget. The goal is to identify problems and provide solutions along the way - the earlier you identify a problem the easier it will be to fix. System performance can be managed like any other project task. We showed how to address software performance in Chapter 3, network performance in Chapter 5, and platform performance in Chapter 7. If you don’t measure your progress as these pieces come together, you will miss the opportunity to identify and make the appropriate adjustments needed to ensure success.

Figure 10.13 shows three key opportunities for measuring performance compliance. When possible it is important to take advantage of opportunities throughout system development and deployment where you can measure progress toward meeting your performance goals. The CPT Test tab includes four tools you can use to translate live performance measurements to workflow service times – the workflow performance targets used to define your initial system design.

Map display render times
In Chapter 3 we shared the important factors that impact software performance. For Web mapping workflows, map complexity is the primary performance driver. Heavy map displays (lots of dynamic map layers and features included in each map extent) contribute to heavy server processing loads and network traffic. Simple maps generate lighter server loads and provided users with much quicker display performance. The first opportunity for building high performance map services is when you are authoring the map display.

There are two map rendering tools available on the CPT Test tab that use measured map rendering time to estimate equivalent workflow service times. One tool is available for translating ArcGIS Desktop map rendering times (MXD) and the other tool is for translating ArcGIS Server map service rendering times (MSD). With both tools, measured map rendering time is translated to workflow services times that can be used by the CPT Calculator and Design tabs for generating your platform solution. The idea is to validate that your map service will perform within your planned system budget by comparing the workflow service times generated from your measured rendering times with your initial workflow performance targets. If the service times exceed your planned budget, you should either adjust the map display complexity to perform within the initial planning budget or increase your system performance budget. The best time to make the map display complexity adjustment is during the map authoring process. Impacts on the project budget can be evaluated and proper adjustments made to ensure delivery success.

Map publishing preview render times

 * Measured MSD render time

MSD render time can be measured when publishing your map service using the service editor preview tool.

"Warning: Make sure to measure a map location that represents the average map complexity or higher within your service area extent."

MXDPerfStat render times

 * Measured MXD render time

MXD render time can be measured using the [MXDperfstat] ArcScript performance measurement tool.

"Warning: Make sure to measure a map location that represents the average map complexity or higher within your service area extent."

System test measured throughput and platform utilization

 * Measured throughput and platform utilization

If you know your platform configuration, your measured peak workflow throughput, and the associated platform utilization the CPT can calculate the workflow service times. The Test tab translation tools can be used to input throughput (transaction per hour), the platform configuration (server platform selection), and the measured platform utilization and excel will translate these inputs to equivalent workflow service times.

"Best practice: Performance metrics can be collected from benchmark test or live operations."

"Warning: Make sure all measurements are collected for the same loads at the same time."

System monitor concurrent users and platform utilization

 * Measured peak concurrent users and platform utilization translator

If you don’t have measured throughput, concurrent users working on the system can be used to estimate throughput loads. This is a valuable tool for using real business activity to validate system capacity (business units identify peak user loads and IT staff identify server utilization observed during these loads). The Test tab can be used to input throughput (peak concurrent users), the platform configuration (server platform selection), and the measured platform utilization and excel will translate these inputs to equivalent workflow service times.

"Best practice: Analysis assumes peak users are working at web power user productivity (6 DPM) over a reasonable measurement period (10 minutes)."

"Warning: Make sure all measurements are collected for the same loads at the same time."


 * Move Test tab derived workflow service times to project workflows.

The CPT Workflow tab is where the results of your performance validation efforts come together. You can bring all your test results together, along with the original workflow service times, to validate that you are building a system that will perform and scale within your established project performance budget.

"Best practice: Performance management, including performance validation throughout development and system delivery, is the key to implementation success. It is important that you identify the right technology and establish reasonable performance goals during your initial system design planning. It is even more important that you monitor progress in meeting these goals throughout final system development and delivery." 

Capacity Planning
The models supporting Esri capacity planning today are based on the performance fundamentals introduced in this section. Platform capacity is determined by the software processing time (platform service time) and the number of platform core, and is expressed in terms of peak displays per minute. Platform capacity (DPM) can be translated to supported concurrent users by dividing by the user productivity (DPM/client).

The performance fundamentals discussed in this chapter are basic concepts that apply to any computer environment, and an understanding of these fundamentals can establish a solid foundation for understanding system performance and scalability. Software and hardware technology will continue to change, and the terms and relationships identified in this section can be used to normalize these changes and help us understand what is required to support our system performance needs.

The next chapter will provide an overview of the Capacity Planning tools introduced throughout the previous chapters. The CPT videos at the end of this chapter focus on system performance validation – showing how the fundamental performance terms and relationships are used by the CPT to connect user requirements with system hardware loads, and how these loads are used to identify appropriate hardware requirements. Performance validation during system design and deployment is also a key topic, sharing how the CPT Test tools can be used to translate real performance measurements to equivalent workflow service times for performance validation.

=CPT Capacity Planning videos=

Previous Editions
Performance Management 43rd Edition

Page Footer Specific license terms for this content System Design Strategies 26th edition - An Esri ® Technical Reference Document • 2009 (final PDF release)