
|
||||||||||||||||||||
|
|
||||||||||||||||||||
OverviewThe SketchEl molecule format was developed for the SketchEl open source project, which is hosted on SourceForge. It is used by all MMI software for storing 2D molecular structures. Products such as MMDS represent molecules using an internal datastructure which can be mapped directly to the SketchEl molecule format, which can be interconverted with no information loss. The format is largely equivalent to the basic properties of an MDL MOL file. A large proportion of ordinary molecules can be converted between SketchEl and MDL MOL formats without information loss, but it should be noted that there are some differences which are crucial for certain types of molecules. In particular, the SketchEl molecule format provides additional information for describing implicit hydrogens and has more available bond orders, which are vital for describing molecular structures that do not follow the Lewis octet rule. The MDL MOL format has a very long list of properties which are not usually used for describing individual small molecules, e.g. query features, stereochemical mixtures, polymer units, etc., none of which are explicitly supported by the SketchEl format. The only features included in the SketchEl format are those which have cheminformatic meaning. The SketchEl format is not intended to be used to describe a drawing, but rather it describes a molecular species with enough detail to ascertain its chemical composition and enough information to produce a drawing from it. This distinction is important, because it means that the format has no properties for encoding drawing style (e.g. font size, line width, colour, double bond positioning, etc.), and it has no fields for additional decorations (e.g. text labels, reaction arrows, molecular orbital lobes). All of these additional properties and decorations can be accomplished in other ways, but they are not a part of the core molecule format. Because the format is minimalistic, and all of the features have a well defined meaning, it is straightforward to interpret a SketchEl molecule in its raw form. Additional properties about the molecular structure need to be calculated after the fact (e.g. aromaticity, stereochemical parity), or stored in an ancilliary field within a datasheet, or they can be encoded as arbitrary data within the atom or bond extension fields, and interpreted as and when necessary. Specification
OutlineA SketchEl character stream is described by the following pattern:
A very simple example - ethanol with implicit hydrogens - is as follows:
A second example, based on the same heavy-atom ethanol structure, is an example of how to not to make a cheminformatically meaningful structure, but nonetheless demonstrates some features of the format:
All 3 atoms, and both bonds, have an invariant expansion string, and a dependent expansion string. The latter will all be removed if the slightest modification is made to the structure. The third atom, oxygen, is represented by an escape code (hex 4f), which is valid even when the raw character is allowed. The first atom is assigned a charge of +1, and its hydrogen count is assigned automatically, and was calculated to be 2. The second atom is assigned a number of hydrogen atoms to be fixed at 2, which coincidently happens to be the same number as would be automatically calculated for a methylene. The third atom is a radical oxygen, with 1 unpaired electron, and an automatic hydrogen count which happens to be 0. Both of the bonds are ordered so that atom 2 is the source point, and the bond types are inclined and declined, respectively, which has no stereochemical meaning since the central carbon atom has symmetry. AbbreviationsInline atom abbreviations are an extension to the core format, which allow atoms to be labelled with abbreviation codes, e.g. Et, tBu, Ph, etc., and retain embedded versions of the abbreviated structure fragment. Because each abbreviation retains its structural content within the description of the molecule, it is still possible to derive the molecular formula correctly without having to resort to an external lookup table. Also, abbreviations can be defined for custom purposes and do not need to be standardised. It is straightforward to expand out abbreviations to obtain a fully atom-specified structure. The structures of inline abbreviations may contain their own inline abbreviations, which are encoded recursively. Software that is capable of reading and writing the core SketchEl format will gracefully ignore abbreviations: the abbreviation labels will be displayed, but their meaning will be unknown. Modifying atoms that do not contain abbreviations is defined to be safe through multiple read/write cycles, which provides a measure of forward compatibility. Consider the following example of butylbenzene:
Structure (a) shows the all heavy atom structure, while structure (b) shows the representation which uses Bu to represent the butyl substituent, and (c) shows the definition of the abbreviation. The SketchEl format representation of (b) is encoded as follows:
Atom 7 has the element label Bu. It is an otherwise unremarkable atom placeholder, except for the extension field: aSketchEl!(5\002C4)\000A*.... The prefix a is not defined in the core format, so it is treated as an invariant expansion field. The content following the letter a is processed using the escape codes, such is why the long string shown above contains so many backslashes. The definition for the Bu "atom" contains the full structure (c) in the diagram above. This abbreviation definition contains 5 atoms. The first atom has the label "*", and the remaining 4 atoms are carbons. The first atom must be defined with an asterisk as the element type. It is a special placeholder atom. When the primary structure and the inline abbreviation are considered together, they are defined as the union of the two structures, where the primary placeholder atom (in this case Bu) and the abbreviation placeholder atom (always the first atom, and with the label *) are deleted. The bond to the abbreviation in the primary structure (i.e. the connection between the phenyl ring and Bu) is deleted, and the bond (or bonds) from the inline abbreviation are used to create them:
As a general rule, the inline abbreviations should contain quality structure coordinates, which means that an alignment of the analogous placeholder atoms by rotation and translation, as suggested in the diagram above, should produce a reasonably readable structure depiction, at least without taking into account congestion. Abbreviations are defined to be strictly terminal, i.e. the atoms of the inline group may be connected to only one of the atoms in the main structure. However, the attachment point may be connected by multiple bonds to atoms within the abbreviation itself. Consider a metal centre bonded to an acetylacetonato ligand:
This coordination geometry can be expressed in terms of an abbreviation (acac), and its definition:
Note that the inline abbreviation has one placeholder atom, *, which is bonded to both of the oxygen atoms of the ligand. In the primary structure representation, the acac abbreviation is connected to the copper atom by one connection point. The fully expanded structure is fully equivalent to the abbreviated version, because when the abbreviation is expanded, the Cu-acac bond is deleted, and the two bonds from the inline abbreviation (one single, one zero-order) are used to replace it. In this way molecular formula, hydrogen atom calculations and valence counting are all preserved. CommentsSome notes on the nuances of the SketchEl molecule format are described below. Bond ordersNonzero bond orders are considered to carry significant meaning, and provide a fairly strong hint as to the pi localisation of electrons within the molecular structure, and resonance patterns thereof. Most organic species can be represented by using bond orders 1 through 3. In the vastness of chemistry, however, most possible structures have some number of bonds for which this assignment is misleading, and so a bond order of 0 should be used to denote a bonding interaction of indeterminate degree. Dative bonds, strong hydrogen bonds and multicentre bonds all qualify. This is a particularly useful interpretation hint for metal-organic structures, where part of the molecule has conventional bonding patterns (usually the organic part), while the interface (typically to the metal) would invalidate normal valence rules.
Consider the fictional platinum complex above. If all of the connections were represented using double or single bonds, an interpretation of the bonding patterns would have to conclude that not all of the atoms are valid Lewis structures, and none of the bond orders have useful meaning. With selected use of zero-order (dotted) bonds, however, considerable information about the structure falls into shape quite easily. The dimethylamine ligand has a coordination bond to the platinum metal, which means that it is reasonable to guess that the number of hydrogen atoms on the nitrogen atom is 1, which is correct. If the coordination bond were drawn as a single line, the ligand would be considered to be an anionic ligand, with no hydrogen atom. On the other side, the pyridine-platinum connection is also described as a zero order bond, which preserves the valence of the aromatic system. The pyridine ring can readily be interpreted as a relatively normal 6-ring aromatic system, rather than a hypervalent non-octet species. The platinum centre has four ligands, two of which are zero order, and two of which are single bonds. Combined with no overall charge, this suggests that the oxidation state is that of Pt(II), which is the commonly accepted designation. On the other side of the pyridine substituent, a hydroxy substituent is shown in a chelated H-bond arrangement with the adjacent ketone. Use of a zero-order bond allows this to be expressed, without drawing a divalent hydrogen atom, and thereby upsetting the otherwise neat and tidy valences. StereochemistryAll stereochemical features are represented by the atom position and bond style. There are no additional fields for chirality. Parity-style assignments, such as the CIP R/S or E/Z systems, must be calculated as needed. Part of the reason for this is that all sketches are considered to be potentially a work-in-progress. Assigning a definitive parity to an atom or bond loses meaning as the molecule is modified, permuted, rotated etc. For sketching purposes, it is far more practical to recalculate these properties from the sketch, and never encode them as fixed values. For stereochemistry which is unknown, or mixed, the unknown bond type can be used. For chiral centres, the mere absence of inclined or declined "wedge bonds" is sufficient. The format draws no distinction between unresolved stereochemistry and mixtures. There is no inherent capability for describing multiple species within a single structure, whether it be stereochemistry, tautomers, isomers, or any other kind of 'mer. This is a deliberate design decision. Systems which try to encode a large amount of chemistry within a single sketch are inevitably either too complex, or too limiting. Mixtures of distinctly different molecular species should be drawn as individual sketches, which has the benefit of being specific, simple and foolproof, at the expense of convenience. Implicit hydrogen atomsOne of the most problematic side effects of the most popular molecular sketch formats is the inability to reliably reconcile the drawing with the corresponding molecular formula, due to over-reliance on automatic calculation of the number of implicit hydrogens attached to the heavy atoms which are explicitly drawn as part of the sketch. For most organic compounds, the number of implied hydrogens is quite simple to calculate, since most of the constituents are first row p-block atoms, and the valences are mostly Lewis-octet based, and it is usually obvious when this is not the case. However, in the absence of a way to mark an atom as not being eligible for automatic implicit hydrogens, problem cases quickly rack up when leaving this comfort zone. For a good illustration, consider the following two sketches:
Both of these compounds show a tin atom connected by single bonds to two substituents. The dimethyl tin compound on the left could be dimethyl tin(II), or it could be dimethyl tin(IV) dihydride. Since the former is an extremely reactive intermediate, and the latter is commercially available, it would be quite reasonable to guess than two implicit hydrogen atoms should be added to top up the valence to make a tin(IV) compound. The dichloro tin compound on the right, however, is more likely to be tin(II) chloride, i.e. adding two implicit hydrogens would most likely be a mistake. However, no matter how good the algorithm is for picking the most likely state, whether to add or not to add is still just a guess, and the author of the sketch could may intended to represent the other possibility. This is clearly unacceptable, since authors of sketches generally know how many hydrogens they want on each of their atoms. Being unable to correctly recalculate the molecular formula of the input structure is a case of unnecessary information loss. The SketchEl molecule format does not prescribe a method for calculating implicit hydrogens, although the formula that it uses internally is conservative. Rather, it has two states:
The actual hydrogen-count formulae used by MMDS (and SketchEl) are:
Any element not in the list has an automatic implicit hydrogen count of zero. Calculated values of less than zero are set to zero. Note that carbon atoms use the absolute value of the atom charge. The bond order sum is the total bond order of all connected neighbours, and is used literally, i.e. a bond order of 0 has no impact on the hydrogen counting. Only five elements are eligible for receiving automatic hydrogen atoms. The remainder of the periodic table consists mainly of atoms to which an attached hydrogen atom is a notable feature. A neutral terminal carbon atom with 3 hydrogens (methyl) is very uninteresting, and sketches are more legible if they are not shown, and more likely to be correct if the user is not required to provide this information. A tin(IV) centre with two methyl groups and two hydridic hydrogen substituents should display its metal-hydrogen substituents proudly, and requires additional specification to confirm that they are intentional. In practice, automatic hydrogen calculation is very useful when sketching molecules, since most atoms are either part of a Lewis-compliant fragment, or are atoms for which implicit hydrogens are never automatically added. The explicit hydrogen field can either be used by setting to 0 and drawing hydrogens as actual atoms, or it can be used to force a certain number to be associated with the atom, and drawn as part of the label when displayed. One of the functional requirements of the format is that the list of atoms plus the number of implicit hydrogens recorded in the atom blocks of the format must make up the entire molecular formula. Some or all of the hydrogen atom counts will usually be calculated automatically, but it is the author's task to ensure that this is overridden manually whenever this would have led to the wrong answer. See Also
|
||||||||||||||||||||
|
||||||||||||||||||||