Modifiable areal unit problem

The modifiable areal unit problem (MAUP) is a source of statistical bias that is common in spatially aggregated data, or data grouped into regions or districts where only summary statistics are produced within each district, especially where the districts chosen are not suitable for the data. Apparent patterns (both spatial and statistical) in the data are caused by the choice of district boundaries as much as the real-world patterns in the phenomena being studied. MAUP is particularly problematic in Spatial Analysis and choropleth maps, in which aggregate spatial data is commonly used.

Background
The issue was discovered in 1934, but the term MAUP was first coined and described in detail by Openshaw, who lamented that "the areal units (zonal objects) used in many geographical studies are arbitrary, modifiable, and subject to the whims and fancies of whoever is doing, or did, the aggregating". The problem is especially apparent when the aggregate data are used for cluster analysis for spatial epidemiology, spatial statistics or choropleth mapping, in which misinterpretations can easily be made without realizing it. Many fields of science, especially human geography are prone to disregard the MAUP when drawing inferences from statistics based on aggregated data. MAUP is closely related to the topic of ecological fallacy and ecological bias.

Ecological bias caused by MAUP has been documented as two separate effects that usually occur simultaneously during the analysis of aggregated data. The scale effect causes variation in statistical results between different levels of aggregation. Therefore, the association between variables depends on the size of areal units for which data are reported. Generally, correlation increases as areal unit size increases. The zone effect describes variation in correlation statistics caused by the regrouping of data into different configurations at the same scale.

Since the 1930’s, research has found extra variation in statistical results because of the MAUP. The standard methods of calculating within-group and between-group variance do not account for the extra variance seen in MAUP studies as the groupings change. MAUP can be used as a methodology to calculate upper and lower limits as well as average regression parameters for multiple sets of spatial groupings.

Suggested solutions
Several suggestions have been made in the literature to reduce aggregation bias during regression analysis. A researcher might correct the variance-covariance matrix using samples from individual-level data. Alternatively, one might focus on local spatial regression rather than global regression. A researcher might also attempt to design areal units to maximize a particular statistical result. Others have argued that it may be difficult to construct a single set of optimal aggregation units for multiple variables, each of which may exhibit non-stationarity and spatial autocorrelation across space in different ways. Others have suggested developing statistics that change across scales in a predictable way, perhaps using fractal dimension as a scale-independent measure of spatial relationships. Others have suggested Bayesian hierarchical models as a general methodology for combining aggregated and individual-level data for ecological inference.

Studies of the MAUP based on empirical data can only provide limited insight due to an inability to control relationships between multiple spatial variables. Data simulation is necessary to have control over various properties of individual-level data. Simulation studies have demonstrated that the spatial support of variables can affect the magnitude of ecological bias caused by spatial data aggregation.

MAUP sensitivity analysis
Using simulations for univariate data, Larsen advocated the use of a Variance Ratio to investigate the effect of spatial configuration, spatial association, and data aggregation. A detailed description of the variation of statistics due to MAUP is presented by Reynolds, who demonstrates the importance of the spatial arrangement and spatial autocorrelation of data values. Reynold’s simulation experiments were expanded by Swift, who in which a series of nine exercises began with simulated regression analysis and spatial trend, then focused on the topic of MAUP in the context of spatial epidemiology. A method of MAUP sensitivity analysis is presented that demonstrates that the MAUP is not entirely a problem. MAUP can be used as an analytical tool to help understand spatial heterogeneity and spatial autocorrelation.

This topic is of particular importance because in some cases data aggregation can obscure strong a correlation between variables, making the relationship appear weak or even negative. Conversely, MAUP can cause random variables to appear as if there is a significant association where there is not. Multivariate regression parameters are more sensitive to MAUP than correlation coefficients. Until a more analytical solution to MAUP is discovered, a spatial sensitivity analysis using a variety of areal units is recommended as a methodology to estimate the uncertainty of correlation and regression coefficients due to ecological bias. An example of data simulation and re-aggregation using the ArcPy library is available.

Examples
Census data may be aggregated into census enumeration districts, postcode areas, police precincts, or any other spatial partition (thus, the 'areal units' are 'modifiable'). Variation in the spatial units used for aggregation causes variation in statistical results.

Another example involving a census would be to find a higher correlation between two variables when the variables are analyzed at the census block level rather than at the larger area of a census tract level. Typically, coarser levels of aggregation will make attributes seem more homogeneous.