Standard Generalized Markup Language

The Standard Generalized Markup Language (ISO 8879:1986 SGML) is an ISO-standard technology for defining generalized markup languages for documents. ISO 8879 Annex A.1 defines generalized markup:

Generalized markup is based on two novel postulates:
 * Markup should describe a document's structure and other attributes, rather than specify the processing to be performed on it, as descriptive markup need be done only once, and will suffice for future processing.
 * Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and data bases, can be used for processing documents as well.

Standard versions
SGML is an ISO standard: "ISO 8879:1986 Information processing — Text and office systems — Standard Generalized Markup Language (SGML)", of which there are three versions:


 * Original SGML, which was accepted in October 1986, followed by a minor Technical Corrigendum.


 * SGML (ENR), in 1996, resulted from a Technical Corrigendum to add extended naming rules allowing arbitrary-language and -script markup.


 * SGML (ENR+WWW or WebSGML), in 1998, resulted from a | Technical Corrigendum to better support XML and WWW requirements.

SGML is part of a trio of enabling ISO standards for electronic documents:


 * SGML (ISO 8879) - generalized markup language
 * SGML was reworked in 1998 into XML, a successful profile of SGML. Contemporarily, full SGML is rarely found or used in new projects.
 * DSSSL (ISO/IEC 10179) - document processing and styling language based on Scheme
 * DSSSL was reworked into W3C XSLT and XSL-FO which use an XML syntax. Contemporarily, DSSSL is rarely used in new projects apart from Linux documentation
 * HyTime (| ISO/IEC 10744) - | generalized hypertext and scheduling
 * Hytime was partially reworked into W3C XLink. Contemporarily, HyTime is rarely used in new projects.

SGML is supported by various technical reports, in particular


 * ISO/IEC TR 9573 - Information processing - SGML support facilities - Techniques for using SGML
 * Part 13: Public entity sets for mathematics and science
 * In 2007, the W3C MathML working group agreed to take of maintenance of these entity sets.

Origins
SGML descended from IBM's Generalized Markup Language (GML) that Charles Goldfarb, Edward Mosher, and Raymond Lorie developed in the 1960s. Goldfarb, editor of the international standard, coined the “GML” term using their surname initials. . As a document markup language, SGML was originally designed to enable the sharing of machine-readable large-project documents in government, law, and industry. Many of these documents must remain readable for several decades — a long time in the information technology field. SGML also was extensively applied by the military, and the aerospace, technical reference, and industrial publishing businesses. The advent of the XML profile has made SGML suitable for widespread application for small-scale, general-purpose use



Validity
SGML (ENR+WWW) defines two kinds of validity. According to the revised Terms and Definitions of IS 8879 (from the public | draft):

A conforming SGML document must be either a type-valid SGML document, a tag-valid SGML document, or both. Note: A user may wish to enforce additional constraints on a document, such as whether a document instance is integrally-stored or free of entity references.

A type-valid SGML document is defined by the standard as

An SGML document in which, for each document instance, there is an associated document type declaration to whose DTD that instance conforms.

A tag-valid SGML document is defined by the standard as

An SGML document, all of whose document instances are fully-tagged. There need not be a document type declaration associated with any of the instances. Note: If there is a document type declaration, the instance can be parsed with or without reference to it.

NOTE Tag-validity was introduced in SGML (ENR+WWW) to support XML which allows documents with no DOCTYPE declaration but which can be parsed without needing a grammar --the standard calls this fully-tagged--, and standalone documents which have a DOCTYPE declaration but this makes no XML Infoset contributions to the document. Integrally stored supports the XML requirement that elements end in the same entity in which they started. Reference-free supports the HTML requirement that entity references are for special characters and do not contain markup. Commentary, especially commentary that was made before 1997 or that is unaware of SGML (ENR+WWW), that discusses SGML validity will be limited to type-validity only.

The SGML emphasis on validity supports the requirement for generalized markup that markup should be rigorous. (ISO 8879 A.1)

Syntax
An SGML document may have three parts:


 * the SGML Declaration,
 * the Prolog, containing a DOCTYPE declaration with the various markup declarations that together make a Document Type Definition (DTD),
 * the instance itself, containing the top-most element and its contents.

An SGML document may be composed from many entities, discrete physical parts. In SGML, the entities and elements types used in the document may be specified with a DTD, the different character sets, features, delimiter sets, and keywords are specified in the SGML Declaration to create the concrete syntax of the document.

The XML specification (s4.3.1) puts it the following way, though note that full SGML allows implicit markup and some other kinds of tags.

Each XML document has both a logical and a physical structure. Physically, the document is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins in a "root" or document entity. Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup.

For introductory information on basic, modern SGML syntax, see the XML entry. The following material concentrates on features that are additional to those found in XML and are not a comprehensive summary of SGML syntax.

Optional features
SGML was developed to generalize and support a wide range of markup languages as found in the mid 1980s. These ranged from Wiki-like terse syntaxes to RTF-like bracketed languages to HTML-like matching-tag languages. SGML did this by a relatively simple default reference concrete syntax augmented with a large number of optional features that could be enabled in the SGML Declaration. Thus not every SGML parser could necessarily process every SGML document, however because the procesor's System Declaration could be compared to the document's SGML Declaration it was always possible to know whether a document was supported by a particular processor.

Many of the features related to markup minimization. However other features related to parallel asynchronous markup (CONCUR), to linking processing attributes (LINK), and to embedded SGML documents directly within SGML documents (SUBDOC).

The notion of customizable features was not appropriate for use over the WWW, so a goal of XML was to reduce optional features to a minimum. However in making its well-formedness rules, XML does not have capabilities to support the declaration and parsing of Wiki-like languages, leaving them unstandardized and difficult to integrate with non-text information systems.

Concrete and abstract syntaxes
The usual (default) SGML concrete syntax resembles this example, which is the default HTML concrete syntax:

SGML provides an abstract syntax that can be implemented in many different types of concrete syntax. Although the markup norm is using angle brackets as start- and end- tag delimiters in an SGML document (per the standard-defined reference concrete syntax), it is possible to use other characters — provided a suitable concrete syntax is defined in the document's SGML declaration. For example, an SGML interpreter might be programmed to parse GML markup, wherein the tags are delimited with a left colon and a right full stop, thus, an :e prefix denotes an end tag:. According to the reference syntax, letter-case (upper- or lower-) is not distinguished in tag names, thus the three tags: (i), (ii)   , and (iii)   are equivalent. NOTE: A concrete syntax might change this rule via the NAMECASE NAMING declarations).

Markup Minimization
SGML has features for reducing the number of characters required to markup a document, which must be enabled in the SGML Declaration. The SGML software need not support every available feature, thus allowing SGML applications to tolerate many types of inadvertent markup omissions; however, SGML systems usually are intolerant of invalid element structures, whereas XML is intolerant of syntax omissions, and does not require a DTD for validation.

OMITTAG
The DTD information specifies whether or not a markup element's start- or end- tags might be omitted; SGML has rules for implying omitted tags, the OMITTAG feature. If a tag must be paired or not (as in the previous   pair example) or if it can occur singly (as an HTML  ), those specifications are defined in the DTD for the markup language being defined (provided the OMITTAG feature is enabled). In this case, the XML counterpart would be the specific empty tag, equivalent to the SGML NET-enabling start-tag, introduced in the TC2 (International Standard ISO 8879:1986, Technical Corrigendum 2, November 1999).

SHORTREF
Tags can be replaced with delimiter strings, for a terser markup, the SHORTREF feature. This markup style is now associated with Wiki markup, e.g. wherein two equals-signs (==), at the start of a line, are the “heading start-tag”, and two equals signs (==) after that are the “heading end-tag”.

SHORTTAG
SGML markup languages whose concrete syntax enables the SHORTTAG VALUE feature, do not require attribute values containing only alphanumeric characters to be enclosed within quotation marks — either double  (LIT) or single   (LITA) — so that the previous markup example could be written:

One feature of SGML markup languages is the "presumptuous empty tagging", such that the empty end tag  in   "inherits" its value from the nearest previous full start tag, which, in this example, is   (in other words, it closes the most recently opened item). The expression is thus equivalent to.

NET
Another feature is the NET (Null End Tag) construction:, which is structurally equivalent to.

Other features
Additionally, the SHORTTAG NETENABL IMMEDNET feature allows shortening tags surrounding an empty text value, but forbids shortening full tags:

can be written as:

 (angled bracket) — hence this construct appears as . The third feature is 'text on the same line', allowing a markup item to be ended with a line-end; especially useful for headings and such, requiring using either SHORTREF or DATATAG minimization. For example, if the DTD includes the following declarations:

(and "&#RE;&#RS;" is a short-reference delimiter in the concrete syntax), then:

is equivalent to:

Formal characterization
SGML has many features that defied convenient description with the popular formal automata theory and the contemporary parser technology of the 1980s and the 1990s; about this, the standard warns in Annex H:  The SGML model group notation was deliberately designed to   resemble the regular expression notation of automata theory, because automata theory provides a theoretical foundation for some aspects of the notion of conformance to a content model. No assumption should be made about the general applicability of automata to content models.

There appears to be no definitive classification of full SGML against a known class of formal grammar. Plausible classes may include tree-adjoining grammars and adaptive grammars.

XML is described as being generally parseable like a two-level grammar for non-validated XML and a Conway-style pipeline of coroutines (lexer, parser, validator) for valid XML.

The SGML standard does not define SGML with formal data structures, such as parse trees, however, an SGML document is constructed of a rooted directed acyclic graph of physical storage units known as “entities”, which is parsed into a rooted directed graph of structural units known as “elements”. NOTE: The physical graph is loosely characterized as an entity tree, but entities might appear multiple times. Moreover, the structure graph is also loosely characterized as an element tree, but the ID/IDREF markup allows arbitrary arcs.

The results of parsing can also be understood as a data tree in different notations; where the document is the root node, and entities in other notations (text, graphics) are child nodes. SGML provides much apparatus for linking to and annotating external non-SGML entities.

The SGML standard describes SGML in terms of maps and recognition modes (s9.6.1). Each entity, and each element, can have an associated notation or declared content type, which determines the kinds of references and tags which will be recognized in that entity and element. Also, each element can have an associated delimiter map (and short reference map), which determines which characters are treated as delimiters in context. The SGML standard characterizes parsing as astate machine switching between recognition modes. During parsing, there is a stack of maps that configure the scanner, while the tokenizer relates to the recognition modes.

Parsing involves traversing the dynamically-retrieved entity graph, finding or imply tags and the element structure, and validating those tags against the grammar. An unusual aspect of SGML is that the grammar (DTD) is used both passively — to recognize lexical structures, and actively — to generate missing structures and tags (which the DTD has declared omissible). End- and start- tags can be ommitted, because they can be inferred from the grammar. Loosely, a series of tags can be omitted, only if there is a single, possible path in the grammar to imply them. It is this active use of grammars that made concrete SGML parsing difficult to formally characterize.

SGML uses the term validation both for the recognition and generation uses of the DTD. XML does not use the grammar (DTD) to change delimiter maps or to inform the parse modes, and does not allow tag omission; consequently, XML validation of elements is not active in the sense that SGML validation is active. SGML without a DTD (e.g. simple XML), is a grammar or a language; SGML with a DTD is a metalanguage. NOTE: SGML with an SGML declaration is, perhaps, a meta-metalanguage, since it is a metalanguage whose declaration mechanism is a metalanguage.

SGML has an abstract syntax implemented by many possible concrete syntaxes, however, this is not the same usage as in an abstract syntax tree and as in a concrete syntax tree. In the SGML usage, a concrete syntax is a set of specific delimiters, while the abstract syntax is the set of names for the delimiters. The XML Infoset corresponds more to the programming language notion of abstract syntax introduced by | John McCarthy.

XML
The W3C XML (eXtensible Markup Language) is a profile (subset) of SGML designed to ease the implementation of the parser compared to a full SGML parser, primarily for use on the WWW. XML currently is more widely used than full SGML. XML has lightweight internationalization based on Unicode. Applications of XML include XHTML, SVG, RSS, Atom, XML-RPC, Semantic Web, and SOAP.

HTML
The design of HTML (Hyper Text Markup Language) was inspired by SGML tagging, but, since no clear expansion and parsing guidelines were established, most HTML documents are invalid as SGML documents. Later, HTML was reformulated (version 2.0) to be more of an SGML application, however, the HTML markup language has many legacy- and exception- handling features that differ from SGML's requirements.

The charter for the recently-revived World Wide Web Consortium HTML Working Group says, "the Group will not assume that an SGML parser is used for 'classic HTML'". Although HTML syntax closely resembles SGML syntax with the default reference concrete syntax, HTML 5 is (reportedly) abandoning conforming with SGML application, explicitly defining its own "html" serialization, although, it also defines an alternative XML-based XHTML 5 serialization, which does conform to SGML (WWW).

OED
The second edition of the Oxford English Dictionary (OED) is entirely marked up with an SGML-esque document markup language.

Others
There are other document markup languages partly-related to the SGML and the XML languages, but — because they cannot be parsed or validated or other-wise processed using standard SGML and XML tools — they are not considered either SGML or XML languages; the Z Format markup language for typesetting and documentation, is an example.

Several modern programming languages support tags as primitive token types, or now support Unicode and regular expression pattern-matching. For example the Scala programming language.

Applications
Document markup languages defined using SGML are called "applications" by the standard; many pre-XML SGML applications were proprietary property of the organizations which developed them, and thus unavailable in the world-wide web. The following list is of pre-XML SGML applications.

TEI
The Text Encoding Initiative (TEI) is an academic consortium who design, maintain, and develop technical standards for digital-format textual representation applications.

DocBook
DocBook is a markup language originally created as an SGML application, designed for authoring technical documentation; DocBook currently is an XML application.

CALS
CALS (Continuous Acquisition and Life-cycle Support) is a US Department of Defense (DoD) initiative for electronically capturing military documents and for linking related data and information.

EDGAR
The EDGAR (Electronic Data-Gathering, Analysis, and Retrieval) system effects automated collection, validation, indexing, acceptance, and forwarding of submissions, by companies and others, who are legally required to file data and information forms with the US Securities and Exchange Commission (SEC).

LinuxDoc
Documentation for Linux packages has used the LinuxDoc SGML DTD and Docbook XML DTD.

Open Source Implementations
Significant open source implementations of SGML have included:


 * | ASP-SGML
 * | ARC-SGML, by Standard Generalized Markup Language Users', 1991, C language
 * | SGMLS, by James Clark, 1993, C language
 * | Project YAO, by Yuan-ze Institute of Technology, Taiwan, with Charles Goldfarb, 1994, object
 * | SP by James Clark, C++ language

SP and Jade, the associated DSSSL processors, are maintained by the | OpenJade project, and are common parts of Linux distributions. An general archive of SGML software and materials resides at | SUNET. The original HTML parser class, in Sun System's implementation of Java, is a limited-features SGML parser, using SGML terminology and concepts.