cml DTD Quick Reference


<a>

Anchor; source/destination of link (HTML 2.0)

<address>

Address, signature, or byline (HTML 2.0)

<b>

Bold text (HTML 2.0)

<base>

The address of the document, so that relative URLs within the document will be added to the BASE URL. (HTML 2.0)

<blockquote>

Quoted passage (HTML 2.0)

<body>

BODY is not used in CML although most of its contents are. Note that unlike most implementations of browser, HTML 2.0 strictly requires that BODY contents start with a tag (e.g. P, HR, Hn, etc.) and NOT with free text. (HTML 2.0)

<br>

Line break (HTML 2.0)

<c.at>

Table of atoms. This is the heart of C.MOL and represents an atom-centred description with optional bonds (C.BO). This is perhaps driven by my background as an inorganic crystallographer. Elemental identity, atomic positions and spacegroup are often necessary and sufficient to describe what the substance is. Many theoretical chemists would agree that, with the addition of the total electron count, everything else is opinion.

C.AT and C.BO can be used to give the molecular formula (connectivity) by the use of attributes such as formal ligand count, number of attached hydrogen atoms, formal charge, etc. Where possible, however, we recommend that C.FORM is used since standardisation is likely to be clearer in that format. The current conventions (SMILES and MOL) could be expanded to include others.

C.AT/C.BO may be difficult to relate to C.FORM. Where C.AT represents coordinate data, this might relate to multiple copies of a molecule (as in crystallography where an asymmetric unit can contain several identical molecules and all the coordinates must be included so that the crystal structure can be recreated.) A related problem is where some of the atomic coordinates are not determined, a frequent occurrence in some techniques.

C.AT and C.BO are linked by the SERID attribute. This need not be an integer, and could be a construct such as CA15. If the tables are edited or modified it will be important to make sure that consistency is obtained and that SERIDs are always unique.

The content model is simple: an optional description (X.DESC), followed by a number of (column) arrays all of length equivalent to the number of atoms. Each X.ARR corresponds to an atomic attribute. The semantics of the attribute is given by one of two mechanisms:

The actual enumeration of the attributes are given in a file 'builtin.ent' and this is definitive, rather than what is written below (although hopefully they are in sync!). It contains:
&mol_arr_builtin;
The semantics of the hardcoded atom attributes are:

<c.bo>

C.BO contains an arbitrary number of arrays (C.ARR) for carrying bond information:

<c.chain>

<c.chir>

Chirality. This is molecular chirality information, not atomic parity. It's unlikely to have much parsable semantics. Major concerns are: enantiomeric purity; absolute and relative streochemistry, etc.

The atomic chirality will be found in C.AT.

<c.coor>

Additional coordinates for the atoms. The intention is that each C.COOR is commensurate with the C.AT records and contains (3) arrays each with X, Y and Z coordinates. Each C.COOR will represent a different observation of the data in C.ATT and it assumed that all other quantities (formal charge, connectivity, etc) remain constant. (There may be some concepts such as Free Energy Perturbation - where one atom mutates into another - which may not be easily represented by this technique). In general C.COOR should be able to represent NMR experiments, different conformations, or disordered molecules in the crystal, and probably much else.

Provision is made for a X.VAR to record the time for dynamics.

<c.crys>

Crystallographic data. This is mainly for the unitcell, spacegroup, and crystallographic experimental data (e.g. wavelengths, etc.)

The content model is an optional description, optional matrices (which could either be used for orthogonalisation or for space group symmetry) and any number of data blocks (X.LIST).

<c.feat>

This is yet to be worked out. It can represent the SW-PROT FEATURES, but many of those are comments on the protein, rather than descriptions of the protein. I haven't tried to see how PDB fits yet. It may have to remain a hybrid.

The SWISS-PROT description is given in the content.

<c.form>

Chemical formula. The primary purpose of this is to say what the molecule is, not to represent ideas about it. No present method covers all molecules, and for many we have only partial info (e.g. stoichiometry). C.FORM allows for one connexion table in the content - but more than one C.FORM is allowed within C.MOl to cover multiple components (especially in crystallographic files).

C.FORM contains an optional hypertext description, an optional mapping onto 3-D coordinates (C.MMAP) and optional X.LIST, X.ARR and X.VAR. The primary use for these latter are connexion tables. The connexion tables can be textual (e.g. SMILES) or the components of an atom-bond based table, following the same convention as in C.AT and C.BO. In C.FORM both atom and bond arrays can be used, which will normally be of different sizes. X.VAR can also be used for reference numbers, etc (MEDLINE, SWISSPROT, Cambridge, etc)

<c.mmap>

Mapping of one molecular representation to another (e.g. 2D-->3D numbering).

Application dependent.

Mapping of connectivity or sequence onto C.AT or some other part of C.MOL. The molecular formula may not correspond to the 3-D coordinates for many reasons:

I welcome simple suggestions here as I haven't thought of any particularly good scheme.

<c.mol>

The content model of a C.MOL (molecule) allows for considerable flexibility in storage.

Among the molecular properties and data it can handle are (in order):

Although many of these could also be held in an XML file without MOL.DTD, the containment within a molecule is very well suited to molecular databases (e.g. crystallography) where all data is "attached" to a molecule.

NOTE: The use of the term 'molecule' is not meant to imply anything about the bonding model or physical nature of the thing in question. C.MOL can be used to hold data on extended solids (such as NaCl) or van der Waals complexes. The bonding model is kept simple to emphasise that for many molecules there need to be additional semantics to specify it adequately. The simple model may be refined over time.

The primary use of C.MOL is to provide at least one way of accurately conveying the precise nature and identity of the substance. This may not always be the best or most efficient.

The present limitations of C.MOL are:

<c.seq>

Biomolecular Sequence. This is intended to cover only those molecules where the chemical identity is an important aspect, and is not intended to intrude into genome structure, etc. It also covers only 'simple' types of sequence (PROTein, DNA, RNA, CARBohydrate). CML will not (at present) provide a comprehensive list of monomers and there is a very limited support for covalently modified molecules.

In general, therfore, this should only be used for 'normal' proteins, small stretches of DNA or RNA without 'unusual' components, and carbohydrates which can be represented by a simple linear text string. It is unsuitable for cyclic molecules, modified bases, unusual aminoacids, brached saccharides, etc. The chain termination is also unlikely to be well defined (e.g. monophosphate?, acetylated N-terminus?). Covalent modifications may be described textually (e.g. 'glycosylated').

<c.symm>

Molecular (not crystallographic) symmetry. The author can specify a point group or a set of symmetry operations (this could be useful for a helical molecule or one in a non-standard orientation.)

The content is the symmetry operators as (4*3) matrices (X.ARR). These should have the form [R|t] where R premultiplies the coordinates and t is a column translation vector. It is up to the author whether they give a complete set of operators (e.g. 48 for Oh) or whether they give just the group generators. The identity matrix can be assumed to be present in all cases.

<cite>

Name or title of cited work (HTML 2.0)

<cml>

CML has a simple content allowing a very flexible approach to the construction of CML files. As CML is s asuperset of the three DTDs it can be used for any applications restricted to two or less (note that XML often uses HTML and MOL often uses XML).

<code>

Source code phrase (HTML 2.0)

<dd>

Definition of term (HTML 2.0)

<dir>

Not used in CML (HTML 2.0)

<dl>

Definition list, or glossary (HTML 2.0)

<dt>

Term in definition list (HTML 2.0)

<em>

Emphasized phrase (HTML 2.0)

<h1>

Heading, level 1 (HTML 2.0)

<h2>

Heading, level 2 (HTML 2.0)

<h3>

Heading, level 3 (HTML 2.0)

<h4>

Heading, level 4 (HTML 2.0)

<h5>

Heading, level 5 (HTML 2.0)

<h6>

Heading, level 6 (HTML 2.0)

<head>

Container for meta-information. All CML documents must have a HEAD, which must include a TITLE. All other components are optional though users are well adavised to think of including them. (HTML 2.0)

<hr>

Horizontal rule (HTML 2.0)

<html>

Document type for HTML (top level container). In HTML documents there is a HEAD and BODY. These would confuse XML authors and so the body content of HTML is used within the X.HTML container (which also has additional attributes). (HTML 2.0)

From the HTML 2.0 DTD:


Document Type Definition for the HyperText Markup Language
(HTML DTD)

$Id: html.dtd,v 1.29 1995/08/04 17:50:22 connolly Exp $

Author: Daniel W. Connolly <connolly@w3.org>
See Also: html.decl, html-1.dtd
http://www.w3.org/hypertext/WWW/MarkUp/MarkUp.html

<!ENTITY % HTML.Version
"-//IETF//DTD HTML 2.0//EN"

-- Typical usage:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
...
</html>
--
>

<i>

Italic text (HTML 2.0)

<img>

Image; icon, glyph or illustration, which may also be a clickable map (ISMAP). (HTML 2.0)

<isindex>

Not used in CML (HTML 2.0)

<kbd>

Keyboard phrase, e.g. user input (HTML 2.0)

<li>

List item (HTML 2.0)

<link>

This is discussed in Murray Altheim's paper on the semantics of addressing. The descriptions of the attributes are rather short... (HTML 2.0)

<menu>

Not used in CML (HTML 2.0)

<meta>

This is for describing the contents, purpose, etc of the document. The WWW community has yet to produce clear standards for this and the most promising (1995) is the Dublin Core proposal of eleven categories of meta-information. Until this is developed, CML does not give guidance here although the use of META information is strongly recommended. (HTML 2.0)

<nextid>

Not used in CML (HTML 2.0)

<ol>

Ordered, or numbered list (HTML 2.0)

<p>

Paragraph (note that in strict HTML this is a container <P> ... </P> and so every paragraph should start with <P> (it is not a separator). (HTML 2.0)

<pre>

Preformatted text. (HTML 2.0)

<samp>

Sample text or characters (HTML 2.0)

<strong>

Strong emphais (HTML 2.0)

<title>

All CML documents must have a TITLE. This will normally be rendered as a textual description of the contents or purpose of the document. (HTML 2.0)

<tt>

Typewriter text (HTML 2.0)

<ul>

Unordered list (HTML 2.0)

<var>

Variable phrase or substituable (HTML 2.0)

<x.add>

The address of a person or organisation. It can contain electronic components such as E-Mail or URLs.

<x.arr>

A homogeneous (1- or 2-dimensional) array of variables. The values are given as a white-space-separated string, with quotes around elements containing whitespace. The dimension of the matrix is determined as follows:
If no attributes are given: 1-dimensional.
If SIZE (but not ROWS/COLUMNS) is given: 1-dimensional.
If ROWS and COLUMNS are given: 2-dimensional. SIZE is ignored
If TYPE represents a square or triangular matrix (SQUARE, ANTISYMMETRIC, LOWERTRIANGLE, ORTHOGONAL, SYMMETRIC, UNITARY, UPPERTRIANGLE) only ONE of ROWS or COLUMNS need be given. SIZE is ignored.

An array of anything more complicated (links (A), X.ARR, etc) requires the use of X.LIST rather than X.ARR.

Some arrays will be sparse or have missing values. The word NULL can be used to denote an element for which there is no information. Where many identical values are required, a premultiplier can be used, as in:
1.2 3.4 25*NULL 23*4.5 3*NULL 4.1
which would represent an array of length 54.

Sometimes (as for a controlling variable or an axis on a graph) an array can be generated from a linear expression and the values in the content can be omitted. In this case TYPE must be INTEGER or FLOAT, START and DELTA must both be given and be of this type and SIZE must also be given. An example:
<X.ARR START="3.1" DELTA="0.3" SIZE="5"></>
is equivalent to:
<X.ARR>3.1 3.4 3.7 4.0 4.3</> .

<x.bib>

X.BIB (simple tool for 'most' bibliographic requirements) Compiled from other bibliographic standards. Deliberately kept simple so as to be readable (I couldn't understand the other ones :-). Because there is no structure, the renderer and authoring tools have to have some semantics.

The content is an optional description (HTML), then (in any order) an optional list of authors (X.AUT), X.VAR (primarily for a date) and an optional list of addresses (X.ADD). The addresses should correspond to the citation/organisation since X.AUT has its own provision for addresses.

<x.fig>

A figure. At present the figure has no internal semantic content, but can carry textual description and other attributes.

How to transport the figure is not yet solved. I have provided for two possibilities:

The content is therefore an optional description (caption) (HTML) and an optional (encoded) file.

<x.fre>

Free text for any purpose. The primary purpose is to encapsulate foreign material and describe it with attributes (not for authors to write semantically void text!). Various NOTATIONs are described (if you know what to do with them). The contents should NOT include any characters of the sort < / [A-Z] in that order, or the SGML parser will think it's hit and end-tag (ETAGO). It's up to the applications whether they uuencode binary files. If so, it may be worth replacing < with, say, <

<x.html>

HTML allows authors to add hypertext of the complexity of the current HTML language (at Sept. 1995 this is HTML 2.0). Authors are assumed to be familiar with HTML and the DTD will not be documented here. There are, however a few important differences:

<x.list>

A generic container. It can be used to construct most of the common container classes (although these can only be validated at postprocessing time). The DTD imposes very little constraints on how X.LIST can be used, but CONTENT can be set to show certain common methods. X.LIST can contain any or all of the common generic data items (A, X.ARR, X.VAR and X.LIST itself). The commonest uses are:

Note that the counts (COLUMNS, ROWS, SIZE) are advisory and primarily used for checking. The postprocessor is assumed to be able to count.

<x.pers>

A person. The block can contain an address and various other attributes

<x.sym>

Symbolic variable. Some numeric or string quantities may have to be represented by a symbolic variable rather than an explicit value. X.SYM is an experimental approach towards this and the details have not yet been worked out.

<x.var>

Generic variable. The type (TYPE) and UNITS maybe specified. May be extended to simple geometrical objects (e.g. point, circle, etc).

<xml>

The toplevel container for XML files. It consists of a HEAD and any of the XML elements in any order. Rarely needed unless MOL is excluded from the DTD.


cml DTD