Table of atoms. This is the heart of C.MOL and represents an atom-centred description with optional bonds (C.BO). This is perhaps driven by my background as an inorganic crystallographer. Elemental identity, atomic positions and spacegroup are often necessary and sufficient to describe what the substance is. Many theoretical chemists would agree that, with the addition of the total electron count, everything else is opinion.
C.AT and C.BO can be used to give the molecular formula (connectivity) by the use of attributes such as formal ligand count, number of attached hydrogen atoms, formal charge, etc. Where possible, however, we recommend that C.FORM is used since standardisation is likely to be clearer in that format. The current conventions (SMILES and MOL) could be expanded to include others.
C.AT/C.BO may be difficult to relate to C.FORM. Where C.AT represents coordinate data, this might relate to multiple copies of a molecule (as in crystallography where an asymmetric unit can contain several identical molecules and all the coordinates must be included so that the crystal structure can be recreated.) A related problem is where some of the atomic coordinates are not determined, a frequent occurrence in some techniques.
C.AT and C.BO are linked by the SERID attribute. This need not be an integer, and could be a construct such as CA15. If the tables are edited or modified it will be important to make sure that consistency is obtained and that SERIDs are always unique.
The content model is simple: an optional description (X.DESC), followed by a number of (column) arrays all of length equivalent to the number of atoms. Each X.ARR corresponds to an atomic attribute. The semantics of the attribute is given by one of two mechanisms:
The actual enumeration of the attributes are given in a file 'builtin.ent'
and this is definitive, rather than what is written below (although hopefully
they are in sync!). It contains:
&mol_arr_builtin;
The semantics of the hardcoded atom attributes are:
It is often conventional to split the ligands into hydrogen atoms and others because many chemical structure diagrams and many connection tables are hydrogen-suppressed. Note that bridging hydrogens (as in electron-deficient compounds) and isotopically substituted hydrogen atoms may need explicit inclusion here.
The chiral volume of a tetrahdron with 4 vertices at X1, X2, X3, X4, is given by the determinant:
|1 1 1 1 | |x1 x2 x3 x4| /6 |y1 y2 y3 y4| |z1 z2 z3 z4|
The four atoms representing the corners of the tetrahedron (PID1-PID4) must be specified. For atoms without described parity, these fields should be NULL.
C.BO contains an arbitrary number of arrays (C.ARR) for carrying bond information:
Chirality. This is molecular chirality information, not atomic parity. It's unlikely to have much parsable semantics. Major concerns are: enantiomeric purity; absolute and relative streochemistry, etc.
The atomic chirality will be found in C.AT.
Additional coordinates for the atoms. The intention is that each C.COOR is commensurate with the C.AT records and contains (3) arrays each with X, Y and Z coordinates. Each C.COOR will represent a different observation of the data in C.ATT and it assumed that all other quantities (formal charge, connectivity, etc) remain constant. (There may be some concepts such as Free Energy Perturbation - where one atom mutates into another - which may not be easily represented by this technique). In general C.COOR should be able to represent NMR experiments, different conformations, or disordered molecules in the crystal, and probably much else.
Provision is made for a X.VAR to record the time for dynamics.
Crystallographic data. This is mainly for the unitcell, spacegroup, and crystallographic experimental data (e.g. wavelengths, etc.)
The content model is an optional description, optional matrices (which could either be used for orthogonalisation or for space group symmetry) and any number of data blocks (X.LIST).
This is yet to be worked out. It can represent the SW-PROT FEATURES, but many of those are comments on the protein, rather than descriptions of the protein. I haven't tried to see how PDB fits yet. It may have to remain a hybrid.
The SWISS-PROT description is given in the content.
Chemical formula. The primary purpose of this is to say what the molecule is, not to represent ideas about it. No present method covers all molecules, and for many we have only partial info (e.g. stoichiometry). C.FORM allows for one connexion table in the content - but more than one C.FORM is allowed within C.MOl to cover multiple components (especially in crystallographic files).
C.FORM contains an optional hypertext description, an optional mapping onto 3-D coordinates (C.MMAP) and optional X.LIST, X.ARR and X.VAR. The primary use for these latter are connexion tables. The connexion tables can be textual (e.g. SMILES) or the components of an atom-bond based table, following the same convention as in C.AT and C.BO. In C.FORM both atom and bond arrays can be used, which will normally be of different sizes. X.VAR can also be used for reference numbers, etc (MEDLINE, SWISSPROT, Cambridge, etc)
Application dependent.
Mapping of connectivity or sequence onto C.AT or some other part of C.MOL. The molecular formula may not correspond to the 3-D coordinates for many reasons:
I welcome simple suggestions here as I haven't thought of any particularly good scheme.
The content model of a C.MOL (molecule) allows for considerable flexibility in storage.
Among the molecular properties and data it can handle are (in order):
Although many of these could also be held in an XML file without MOL.DTD, the containment within a molecule is very well suited to molecular databases (e.g. crystallography) where all data is "attached" to a molecule.
NOTE: The use of the term 'molecule' is not meant to imply anything about the bonding model or physical nature of the thing in question. C.MOL can be used to hold data on extended solids (such as NaCl) or van der Waals complexes. The bonding model is kept simple to emphasise that for many molecules there need to be additional semantics to specify it adequately. The simple model may be refined over time.
The primary use of C.MOL is to provide at least one way of accurately conveying the precise nature and identity of the substance. This may not always be the best or most efficient.
The present limitations of C.MOL are:
Biomolecular Sequence. This is intended to cover only those molecules where the chemical identity is an important aspect, and is not intended to intrude into genome structure, etc. It also covers only 'simple' types of sequence (PROTein, DNA, RNA, CARBohydrate). CML will not (at present) provide a comprehensive list of monomers and there is a very limited support for covalently modified molecules.
In general, therfore, this should only be used for 'normal' proteins, small stretches of DNA or RNA without 'unusual' components, and carbohydrates which can be represented by a simple linear text string. It is unsuitable for cyclic molecules, modified bases, unusual aminoacids, brached saccharides, etc. The chain termination is also unlikely to be well defined (e.g. monophosphate?, acetylated N-terminus?). Covalent modifications may be described textually (e.g. 'glycosylated').
Molecular (not crystallographic) symmetry. The author can specify a point group or a set of symmetry operations (this could be useful for a helical molecule or one in a non-standard orientation.)
The content is the symmetry operators as (4*3) matrices (X.ARR). These should have the form [R|t] where R premultiplies the coordinates and t is a column translation vector. It is up to the author whether they give a complete set of operators (e.g. 48 for Oh) or whether they give just the group generators. The identity matrix can be assumed to be present in all cases.
From the HTML 2.0 DTD:
Document Type Definition for the HyperText Markup Language (HTML DTD) $Id: html.dtd,v 1.29 1995/08/04 17:50:22 connolly Exp $ Author: Daniel W. Connolly <connolly@w3.org> See Also: html.decl, html-1.dtd http://www.w3.org/hypertext/WWW/MarkUp/MarkUp.html <!ENTITY % HTML.Version "-//IETF//DTD HTML 2.0//EN" -- Typical usage: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <html> ... </html> -- >
The address of a person or organisation. It can contain electronic components such as E-Mail or URLs.
A homogeneous (1- or 2-dimensional) array of variables. The values are
given as
a white-space-separated string, with quotes around elements containing
whitespace. The dimension of the matrix is determined as follows:
If no attributes are given: 1-dimensional.
If SIZE (but not ROWS/COLUMNS) is given: 1-dimensional.
If ROWS and COLUMNS are given: 2-dimensional. SIZE is ignored
If TYPE represents a square or triangular matrix (SQUARE, ANTISYMMETRIC,
LOWERTRIANGLE, ORTHOGONAL, SYMMETRIC, UNITARY, UPPERTRIANGLE) only ONE of
ROWS or COLUMNS need be given. SIZE is ignored.
An array of anything more complicated (links (A), X.ARR, etc) requires the use of X.LIST rather than X.ARR.
Some arrays will be sparse or have missing values. The word NULL can be used
to denote an element for which there is no information. Where many identical
values are required, a premultiplier can be used, as in:
1.2 3.4 25*NULL 23*4.5 3*NULL 4.1
which would represent an array of length 54.
Sometimes (as for a controlling variable or an axis on a graph) an array
can be generated from a linear expression
and the values in the content can be omitted. In this case
TYPE must be INTEGER or FLOAT, START and DELTA must both be given and be
of this type and SIZE must also be given. An example:
<X.ARR START="3.1" DELTA="0.3" SIZE="5"></>
is equivalent to:
<X.ARR>3.1 3.4 3.7 4.0 4.3</> .
X.BIB (simple tool for 'most' bibliographic requirements) Compiled from other bibliographic standards. Deliberately kept simple so as to be readable (I couldn't understand the other ones :-). Because there is no structure, the renderer and authoring tools have to have some semantics.
The content is an optional description (HTML), then (in any order) an optional list of authors (X.AUT), X.VAR (primarily for a date) and an optional list of addresses (X.ADD). The addresses should correspond to the citation/organisation since X.AUT has its own provision for addresses.
A figure. At present the figure has no internal semantic content, but can carry textual description and other attributes.
How to transport the figure is not yet solved. I have provided for two possibilities:
The content is therefore an optional description (caption) (HTML) and an optional (encoded) file.
Free text for any purpose. The primary purpose is to encapsulate foreign material and describe it with attributes (not for authors to write semantically void text!). Various NOTATIONs are described (if you know what to do with them). The contents should NOT include any characters of the sort < / [A-Z] in that order, or the SGML parser will think it's hit and end-tag (ETAGO). It's up to the applications whether they uuencode binary files. If so, it may be worth replacing < with, say, <
HTML allows authors to add hypertext of the complexity of the current HTML language (at Sept. 1995 this is HTML 2.0). Authors are assumed to be familiar with HTML and the DTD will not be documented here. There are, however a few important differences:
In other words, it is an allowable %body.content without FORMS.
A generic container. It can be used to construct most of the common container classes (although these can only be validated at postprocessing time). The DTD imposes very little constraints on how X.LIST can be used, but CONTENT can be set to show certain common methods. X.LIST can contain any or all of the common generic data items (A, X.ARR, X.VAR and X.LIST itself). The commonest uses are:
Example:
<X.LIST STRUCT=PERSON><X.VAR>John Doe<A HREF=jdoe@xyzzy.com></A></X.LIST>
The format of the table is different from the HTML 2.1 tables, (TAB) and even when that comes in, X.LIST will be retained. It has much more possibility for semantics.
Note that the counts (COLUMNS, ROWS, SIZE) are advisory and primarily used for checking. The postprocessor is assumed to be able to count.
A person. The block can contain an address and various other attributes
Symbolic variable. Some numeric or string quantities may have to be represented by a symbolic variable rather than an explicit value. X.SYM is an experimental approach towards this and the details have not yet been worked out.
Generic variable. The type (TYPE) and UNITS maybe specified. May be extended to simple geometrical objects (e.g. point, circle, etc).
The toplevel container for XML files. It consists of a HEAD and any of the XML elements in any order. Rarely needed unless MOL is excluded from the DTD.