Shapefile

The Esri Shapefile or simply a shapefile is a popular geospatial vector data format for geographic information systems software. It is developed and regulated by Esri as a public specification for data interoperability among Esri and other software products. A "shapefile" is actually not a single file, but a collection of files with " ", " ", " ", and other extensions with a common prefix name (e.g., " "). The actual shapefile relates specifically to files with the " " extension, however this file alone is incomplete for distribution, as the other supporting files are required.

Shapefiles spatially describe geometries: points, polylines, and polygons. These, for example, could represent water wells, rivers, and lakes, respectively. Each item may also have attributes that describe the items, such as the name or temperature.

Overview
A shapefile is a digital vector storage format for storing geometric location and associated attribute information. This format lacks the capacity to store topological information. The shapefile format was introduced with ArcView GIS version 2 in the beginning of the 1990s. It is now possible to read and write shapefiles using a variety of free and non-free programs.

Shapefiles are simple because they store primitive geometrical data types of points, lines, and polygons. These primitives are of limited use without any attributes to specify what they represent. Therefore, a table of records will store properties/attributes for each primitive shape in the shapefile. Shapes (points/lines/polygons) together with data attributes can create infinitely many representations about geographical data. Representation provides the ability for powerful and accurate computations.

While the term "shapefile" is quite common, a "shapefile" is actually a set of several files. Three individual files are normally mandatory to store the core data that comprises a shapefile. There are a further eight optional files which store primarily index data to improve performance. Each individual file should conform to the MS DOS 8.3 filenameing convention (8 character filename prefix, fullstop, 3 character filename suffix such as ) in order to be compatible with past applications that handle shapefiles. For this same reason, all files should be located in the same folder.

Mandatory files :


 * &mdash; shape format; the feature geometry itself


 * &mdash; shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly


 * &mdash; attribute format; columnar attributes for each shape, in dBase III format

Optional files :


 * &mdash; projection format; the coordinate system and projection information, a plain text file describing the projection using well-known text format


 * and  &mdash; a spatial index of the features
 * and  &mdash; a spatial index of the features for shapefiles that are read-only
 * and  &mdash; an attribute index of the active fields in a table or a theme's attribute table
 * &mdash; a geocoding index for read-write shapefiles
 * &mdash; a geocoding index for read-write shapefiles (ODB format)
 * &mdash; an attribute index for the  file in the form of shapefile.columnname  (ArcGIS 8 and later)
 * &mdash; metadata in XML format
 * &mdash; used to specify the code page (only for ) for identifying the character encoding to be used

In each of the,  , and   files, the shapes in each file correspond to each other in sequence. That is, the first record in the  file corresponds to the first record in the   and   files, and so on. The  and   files have various fields with different endianness, so as an implementor of the file formats you must be very careful to respect the endianness of each field and treat it properly.

Shapefiles deal with coordinates in terms of X and Y, although they are often storing longitude and latitude, respectively. While working with the X and Y terms, be sure to respect the order of the terms (longitude is stored in X, latitude in Y).

Shapefile shape format
The main file contains the primary geographic reference data in the shapefile. The file consists of a single fixed length header followed by one or more variable length records. Each of the variable length records includes a record header component and a record contents component. A detailed description of the file format is given in the Esri Shapefile Technical Description. This format should not be confused with the AutoCAD shape font source format, which shares the  extension.

The main file header is fixed at 100 bytes in length and contains 17 fields; nine 4-byte (32-bit signed integer or int32) integer fields followed by eight 8-byte (double) signed floating point fields:

The file then contains any number of variable-length records. Each record is prefixed with a record-header of 8 bytes:

Following the record header is the actual record:

The variable length record contents depend on the shape type. The following are the possible shape types:

In common use, shapefiles containing Point, Polyline, and Polygon are extremely popular. The "Z" types are three-dimensional. The "M" types contain a user-defined measurement which coincides with the point being referenced. Three-dimensional shapefiles are rather uncommon, and the measurement functionality has been largely superseded by more robust databases used in conjunction with the shapefile data.

Shapefile shape index format
The shapefile index contains the same 100-byte header as the  file, followed by any number of 8-byte fixed-length records which consist of the following two fields:

Using this index, it is possible to seek backwards in the shapefile by seeking backwards first in the shape index (which is possible because it uses fixed-length records), reading the record offset, and using that to seek to the correct position in the  file. It is also possible to seek forwards an arbitrary number of records by using the same method.

Shapefile attribute format
Attributes for each shape are stored in the dBase format. An alternative format that can also be used is the xBase format, which has an open specification, and is used in open source Shapefile libraries, such as the Shapefile C library. By opening the .dbf in Microsoft Excel, the attributes can be viewed using that software; but if edited within Excel, the shapefile can become corrupted. Nevertheless, users often open .dbf files in Excel and save them as spreadsheets in order to view and distribute shapefile attributes outside of GIS software.

Shapefile projection format
The projection information contained in the  file is critical in order to understand the data contained in the   file correctly. Although it is technically optional, it is most often provided, as it is not necessarily possible to guess the projection of any given points. The file is stored in well-known text (WKT) format.

Some typical information contained in the  file is:


 * Geographic coordinate system
 * Datum (geodesy)
 * Spheroid
 * Prime meridian
 * Map projection
 * Units used
 * Parameters necessary to use the map projection, for example:
 * Latitude of origin
 * Scale factor
 * Central meridian
 * False northing
 * False easting
 * Standard parallels

Shapefile spatial index format
This is a binary spatial index file, which is used only by Esri software. The format is not documented, and is not implemented by other vendors. The  file is not strictly necessary, since the   file contains all of the information necessary to successfully parse the spatial data.

Topology and shapefiles
Shapefiles do not have the ability to store topological information. ArcInfo coverages and Personal/File/Enterprise Geodatabases do have the ability to store feature topology.

Spatial representation
The edges of a polyline or polygon are defined using points. The spacing of the points implicitly determines the scale for which the data are useful. Exceeding that scale results in jagged representation of features. Additional points would be required to achieve smooth shapes at greater scales. For features better represented by smooth curves, the polygon representation requires much more data storage than, for example, splines, which can capture smoothly varying shapes efficiently. None of the shapefile types supports splines.

Data storage
The maximum size of either  or   component files cannot exceed 2 GB (or 231 bits). This translates to, at best, about 70 million point features. The maximum number of feature storage for other geometry types varies depending on the number of vertices used.

The attribute database format for the  component file is based on an older dBase standard. This database format inherently has many limitations, including:
 * Incapable of storing null values (this is a serious issue for quantitative data, as it may skew representation and statistics as null quantities are often represented with 0)
 * Poor support for Unicode field names or field storage
 * Maximum length of field names is 10 characters
 * Maximum number of fields is 255
 * Supported field types are: floating point (13 character storage), integer (4 or 9 character storage), date (no time storage; 8 character storage), and text (maximum 254 character storage)
 * Floating point numbers may contain rounding errors since they are stored as text

Mixing shape types
Because the shape type precedes each record, a shape file is physically capable of storing a mixture of different shape types. However, the specification states, "All the non-Null shapes in a shapefile are required to be of the same shape type." Therefore this ability to mix shape types must be limited to interspersing null shapes with the single shape type declared in the file's header. A shape file must not contain both Polyline and Polygon data, for example, and the descriptions for a well (point), a river (polyline) and a lake (polygon) would be stored in three separate files.

Note for the future
Although shapefile has become commmon place in the geospatial market and was adopted as the default interchangeable GIS data format, its limitations are signficant. It will alter the precision of data that has been translated into it from a more precise format (unless specific steps are taken), it does not support look up tables in domain format, as noted eariler it does not recognize topology (which almost disqualifies it by definition from being a true GIS format), its fields are based upon dbase and have significant type and width limitations.

If given a choice between File Based Geodatabase (FBGDB) and Shapefile, the choice should almost always be geodatabase FBGDB. It is an antiquated format now that the geodatabase format has been moderized and fully adopted. If Esri makes the source code open for FBGDB, shapefiles will then be fully antiquated.