by Dr. Alex M. Clark
The field of cheminformatics has file formats which have come to be accepted as de facto industry standard methods for exchanging diagrams of chemical structures, and associated data. Commonly called MDL MOL (or sometimes molfile), for single molecules, and MDL SDF (or sometimes SDfile) for multiple molecules and data, these are both described in detail in the CTfile Formats reference document.
The MDL MOL file format consists of a Connection Table (CTAB) block with a three line header. A very simple example is a diagram of acetone:
The format is simple enough, when used to represent a simple molecule. After the 3 lines of optional padding is the counts line, which most importantly specifies the number of atoms and bonds that follow. Each atom and bond has one line, and uses Fortran-style fixed width formatting to encode its properties. Reading or writing a simple MDL MOL file can be accomplished easily with any programming language.
There is a lot more that goes into an MDL MOL file, once the content gets complicated. And there are also some serious limitations, many of which do not have viable workarounds. These will be discussed in detail presently.
The MDL SDF format is a stream-like extension to the MDL MOL file. After each molecular structure is described, some number of associated data fields can follow, until the record is terminated. A single entry for acetone might be as follows:
As long as each record is terminated with '$$$$', any number of records can be stacked sequentially in a file or stream. There are no more records when the stream reading operation indicates EOF (end-of-file). Composing an SD file from a collection of MOL files and text-friendly data is very simple. Manipulating SD files is also very simple if the amount of interpretation required is low (as long as none of the gotchas are triggered), since the records can easily be split up, modified, and put back together with only a modicum of text-parsing effort.
The MDL MOL and MDL SDF formats have been in use for a long time, and arguably predate the personal computer. The number of programs that have been written to be aware of either or both of these formats is innumerable, since the set includes monolithic modelling packages, single-use scripts, and everything in between. The ubiquity of these formats is such that it would be generally be foolish for any cheminformatics software to neglect to provide an adequate ability to read and write them.
Although the need to interconvert with these formats is not in question, there are a number of reasons why these formats are not ideal for use as the default native format by modern software. The remainder of this article attempts to summarise these shortcomings.
It should be mentioned that there is no intention to be disrespectful of these formats, or anybody who has had anything to do with them! Software evolves. Priorities change. Short term needs override long term strategy. Ideas outlive their original life expectancy. The MDL MOL and MDL SDF formats will still be with us 20 years from now, so it makes sense to at least be aware of what they can and can't do.
The following issues are likely to be encountered with the single-molecule MDL MOL format.
No recognition prefix
There is no "special code" at the beginning of the file that indicates the format. This is usually not such a big deal when reading from a file (e.g. something.mol) because the file extension typically provides enough of a clue. Other cases, such as storage in an SQL field that is used solely for molecules of the same type, also do not suffer because their type has been established by an existing contract.
Reading an MDL MOL file from a stream of unknown type is a situation that occurs frequently enough. For example, when pasting arbitrary text from the clipboard into a sketcher: the format must be guessed and handled appropriately. It could be unformatted text; it could be a SMILES string; it could be a MOL file where the first line is a SMILES string used as the molecule name; or it could be an SD file, which is initially indistinguishable.
There is one part of the file that is relatively consistent, on line 4:
The V2000 tag is supposed to reside at the exact column for all files. While most cases do abide by this rule, there
are files which lack this tag, or put it in the wrong place. There is also a newer, less-common update, which uses V3000
as the identifier. In practice it is generally a good idea to verify as much of the counts line as possible, such as parsing the
number of atoms and bonds and making sure they are plausible. It is useful to read ahead in the stream and look for end-of-file,
Distinguishing a MOL file from an unrelated format can be done with reasonably high reliability, but it is inconvenient. To be done in a fault-tolerant way in order to be able to read as many files as possible, most of the parsing work has to be done before the final classification can be made. Ascertaining the difference between MDL MOL and MDL SDF is significantly more painful, in the absence of some other clue (such as file extension, .mol vs .sdf), because both file types start in exactly the same way, which requires a stream reader to read ahead and perform tentative parsing, looking for a separator tag or other evidence that the file is in fact some other type.
The molecule name is very frequently blank, which means that a large proportion of MDL MOL files start out with a blank line. This is quite inconvenient for certain purposes, e.g. pasting the text into an email, and having the recipient copy out the relevant part. Programmers who are unfamiliar with the format also waste time with the classic "rookie mistake" or thinking that the initial blank line is unnecessary padding.
Embedding MDL MOL content within other formats that take liberties with whitespace (such as XML, which is liable to be "pretty printed"), has the potential to corrupt the data (although XML has the CDATA directive to guard against this).
Awkward molecule name
The first line is reserved for the molecule name. Despite being blank much of the time, it is also a mismatched concept. Providing a molecule name with a structure only makes sense under circumstances where the molecule is considered to be a distinct immutable thing, that can associated with data that refers to the whole structure. This means that whenever the molecular structure is used for something that is more like a fragment (like for example when it is loaded into any kind of editor), it no longer makes sense to associate name and molecule. Typically what happens is that programs that are designed to manipulate structures throw away the name; those which are designed to treat them as invariable objects keep them; and some programs try to balance both objectives, and abandon the name only under certain circumstances.
Storing a name with a structure is a convenience feature, but its usage is poorly defined, which makes it more trouble than it is worth. It would have been better to insist that molecule names have to be in some auxiliary field, such as are available when using MDL SDF, which is a meaningful association.
There are two major types of MDL MOL file specifications: V2000 and V3000. V3000 is a superset of V2000. The difference can be ascertained by examining the counts line. The V3000 format has a number of additional features, such as enhanced stereochemistry specification, but the most profound difference is that there is no limit to the number of atoms and bonds in a molecule. The V2000 files allowed up to 1000 of each.
Because 1000 atoms is large enough for most molecular diagram sketches, and the improvements offered by V3000 are mostly rather obscure, the incentive to update software to be able to read and write the V3000 format is not great. The V3000 format extension has a very different style than the column-based V2000 fields, and writing a minimally compliant parser takes more effort. For this reason anybody writing software to output MDL MOL files would be advised to use V2000 whenever the additional features are not required. When this is not possible (e.g. >1000 atoms), it is necessary to switch to the later format, and just hope for the best.
The files do not have a very high information-to-byte ratio. For example, the line making up a single atom always uses up 69 ASCII characters, plus the newline separator:
This column-regular style made sense in an earlier age were FORTRAN was the language of choice for scientists, because the text parsing tools in that language were so crude that the coding burden overwhelmed other considerations. This made sense in the 1970s, but has been not the case for a very long time. Encoding the same information using a comma-deliminated style, and assuming zero for unspecified properties, could be encoded as:
which uses a mere 13 characters, and is very easy to parse. While this point may seem pedantic in this age of overly powerful desktop computers, the reality is that MDL MOL is often the format of choice used for storing collections of molecules numbering in the millions. Using a format which lacks compactness, with no corresponding improvement in ease of use, places an unnecessary burden on large scale cheminformatics techniques, particularly disk space, network I/O and CPU time wasted scanning through all the extra bytes.
Note that proposing an XML alternative to MDL MOL would likely have a similar level of verbosity.
There is a lot of wasted space in an MDL MOL file, and many of the hardcoded fields are only used for queries, which are not
valid for structures that describe existing molecular species, and it is tempting to slip in extra data. There are also a number
of documented optional annotations that are permitted in the "M-block" (such as
Unfortunately the format does not have any forward compatibility provisions. In the worst case scenario, an external program which does not acknowledge your extensions would refuse to read the file at all when it finds something that is not listed in the specification. Most other software will simply ignore fields that it does not understand. Some programs make an effort to modify the content as little as possible, and may preserve additional data buried in the atom and bond rows, or have unrecognised M-block tags come along for the ride during the read/modify/write cycle. This is likely to have unintended consequences, however, because most M-block tags use atom or bond indices to specify which part of the molecule they apply to. If the atom or bond indices are modified, the software will not know how to update the custom tag, and so the output will be invalid.
For most practical purposes, the MDL MOL format cannot be extended. The properties described in the official specification should be used exclusively, and only for the purposes they were intended.
Too few bond types
Bond orders are restricted to being single, double or triple. The format was designed to represent pure organic structures which conform to the Lewis octet rule and conform to the simplest notions of covalent bonding by sharing pairs of electrons.
Outside the realm of molecules composed from the common organic subset, there are many more types of bonds, a great many of which are hard to describe - it is not always entirely unambiguous how many electrons are involved, or which atoms are involved. One thing is clear though: the single/double/triple classification system is inadequate. When assignment of bond order becomes complicated, the de facto standard for MDL MOL is to describe the bond as single, which is reasonable when producing a graphical diagram of the sketch, because it will be rendered as a single line. Unfortunately this can make a mess of hydrogen counting, valence assignment and formal charge. The addition of bonds with an order of 0 would alleviate this problem significantly, but this value is disallowed, and there is no way to work around it, which means that most inorganic, organometallic and non-Lewis conformant molecules cannot be represented in a meaningful way.
Implicit hydrogen control
One of the biggest problems with the format happens when stepping outside of the realm of simple organic compounds, at which point it can become difficult or impossible to draw a molecule in a way that correctly implies its molecular formula.
The format was originally designed to capture 2D sketches of molecules, such as those drawn on paper since the mid 19th century. Chemists have long used a shorthand notation which involves not writing the letter "C" for carbon, and not explicitly drawing all of the hydrogens, because it is usually obvious. For example, a terminal methyl group is CH3. This is understood. If it were any other way, it would be interesting, and it would be clearly indicated. Drawing all explicit hydrogen atoms all the time is no more fun on a computer than it is with a pen and paper, and popular software reflects this preference.
The problem with the MDL MOL format is that it assumes that it is always possible to calculate the number of hydrogens implicitly attached to any atom, reliably, with a simple formula based on valence environment. It must be assumed that all atoms will be "topped up" with some computed number of hydrogen atoms.
A simple example can be illustrated by a tale of two tins, drawn in shorthand without any explicit hydrogens:
Tin is in group 14, which means it has a Lewis valence of 4. On the left is tin with two methyl substituents, each of which uses up one valence point, and because there is no charge, the formula would suggest that 2 hydrogens should be added to make a whole molecule. In the absence of further information, this is probably correct. Organostannanes are typically most stable when their composition follows the octet rule. On the right is tin with two chloro substituents. The same valence formula would suggest that there are two extra hydrogen atoms attached to the tin atom, but this is probably wrong, because tin(II)chloride is a stable, commercially available material which is sold in a plastic bucket and used in undergraduate laboratories. The molecule with two extra hydrogen atoms is very reactive, and is probably not what was intended by the sketch.
What to do about this? There are a few options, all of which are unsatisfactory. There is a field in the MDL MOL file for controlling implicit hydrogen count, but it is reserved for query structures, and using it to describe actual structures is invalid, and will break many programs.
Two ways to encode chirality
The format has two ways to specify atom-centred tetrahedral chirality, and they are not harmonious. Chemists are accustomed to representing chiral centres with wedges to denote that a bond is oriented "up" or "down" relative to the page. This convention is well understood, and when used properly, there is sufficient information to calculate other labels, such as the R/S assignments used by the CIP system. The format provides properties for annotating the bonds with wedges, but it also has a parity code that can be assigned to individual atoms, as long as they have either 3 or 4 connected neighbours. The parity system is like R/S, except that it uses atom order as its priority system.
For a molecular structure that does not need to be modified any further, these two methods for specifying chirality can both be used, and as long as they are in agreement, either of them are available for further interpretation. The problems start when they do not agree. It is not clear which has precedence: the user-drawn sketch with wedges, or the calculated parity value. For practical purposes it depends. A molecule can be encoded in the MDL MOL format when it does not have 2D sketch coordinates (e.g. 3D structures, or structures where there is no atom position information at all); if this is the case, then the parity values can correctly encode the chirality, while any wedge bonds are rendered meaningless. However, if the structure is a sketch, then the user-defined layout and wedges are most likely what the user intended to represent. If the parity is missing or inconsistent, it should probably be recalculated, but if there is any doubt, it is not clear how they should be reconciled.
It is tempting to use the parity system as the definitive reference point, but this is not practical for structures if they are being modified. Any changes to the system, e.g. adding, deleting or reordering atoms, or even moving their positions, will require that the parity be updated, and it must be known whether the editing operation was supposed to preserve or invert the stereochemistry. And if an atom started out with 3 or 4 neighbours, but gets disconnected so it has 2 neighbours or fewer, then it is no longer possible to assign a parity, because the molecule cannot be a chiral centre. If the bonds are added back in, the chirality information will be lost.
The relationship between wedges and parity is not well defined, and must be handled based on the circumstances. The format provides no assistance, which necessitates guesswork, and forces cheminformatics software to choose between generality and robustness, but not both.
One way to encode double bond stereochemistry
Unlike chirality, stereochemistry about restricted-rotation double bonds, such as alkenes, is represented only by the coordinates of the atoms. The historical reason for this is that it is very easy to draw cis/trans (or E/Z) isomers of a double bond. When coordinates are always valid, this is not a serious problem, but it does mean that if atom positions are not reliable for any reason, then all double bond stereochemistry is lost. This is in contrast to chirality, which can preserve this information by way of its parity field.
Because of this, any structure with double bond stereochemistry must always have coordinates which resolve the stereochemistry correctly. This limitation may seem obtuse, but there are cases where cheminformatics algorithms would be much easier to implement if it were possible to discard the atom positions and recreate them later, without losing stereochemistry.
Many unused fields
The documented format for MDL MOL files is quite long. A great many of the officially supported fields, and permitted values, were not designed for the purpose of creating a general purpose industry standard molecule encoding scheme, but rather to store data used by specific programs. Most software has no reason to examine more than a certain useful subset, which is pragmatic enough, but it means that there are very few (if any) programs which have a complete implementation of the format, because most of the fields are irrelevant. The definition of relevance is not necessarily universally agreeded upon.
It would have been much more workable had the format been partitioned into core features, which all compliant software should be able to manipulate correctly, and program-specific extensions, but as it stands, that judgment call has to be made by each cheminformatician.
The following issues are likely to be encountered with the multiple-molecule MDL SDF format.
No stream identifiers
Because the stream typically starts with an MDL MOL and builds on it, determining that a stream of bytes is actually using the MDL SDF format is not entirely simple, if the format is not known for sure ahead of time. The problem is not so acute when reading a file, because the file extension usually provides a good clue (i.e. .sdf).
Confirming the stream type typically involves reading the first record, up to the separator tag, '$$$$'. This is problematic, though, because a single record can in principle be quite huge. In the diabolical case, the first record could have a base64-encoded field thousands of lines long. The stream reader would have to read into a large buffer, and at some point make a judgment call, at the risk of rejecting a legitimate stream by stopping too soon, but also risking the possibility that the stream was of a different type, requiring the operation to roll back and try something else.
To write robust and general software, the read-ahead operation should look for the separator tag, but also parse the first record as it goes, looking for a reason to invalidate the stream-type hypothesis.
No size expectation
When the data is being read from a stream of indeterminate size, there is no information about how many records should be expected. In fact there is also no termination signifier. Software designed to read MDL SDF streams must simply keep reading records until an error occurs, or the end-of-file signifier is sent by the stream control mechanism. For some kinds of streams, it is not always convenient to tell the difference between actual end of stream, and a temporarily empty buffer, and so extra safeguards must be added. Most large-scale streams (except those which wrap random-access sources, such as file reading) do not provide an indication of how long the stream is going to be. This means that reading an enormous MDL SDF stream from a network source, and performing processing operations on each record, cannot provide any meaningful estimate of how much progress has been made, unless the number of records has been shared by some other means.
Molecule is optional
In most cases each MDL SDF record starts with an MDL MOL entry, then proceeds to the data. If for some reason there is no molecule for the record, it is valid to omit it entirely. Guarding against this possibility is not too difficult, although it means that if there is a molecule name, it is important that the first character is not ">".
If the first record in the stream has no molecule, it makes stream-type recognition a much more painful chore. Consider a stream that starts with the following valid SD record:
The number of identifiers that confirm the MDL SDF format is much fewer. The first character being ">" helps,
but it is hardly definitive. The MDL SDF
format also has a considerable amount of flexibility as to how the field names are organised. The most common style is
In a nutshell, devising a robust, efficient and reliable way of typing any valid MDL SDF stream is actually a significant challenge.
No document header
Because the MDL SDF format is defined as an unrelated collection of molecule/data records concatenated together, with no header or footer, there is nowhere to store document metadata. It is almost always the case that a file containing molecules and data has context information associated with it. There is usually a reason why the molecules are together, and at the very least it would be useful to have a freeform description field, but there is none.
Obviously if the data is stored as a file, then properties such as the filename, date, etc. provide limited metadata, but when used as a stream, this disappears. One possibility is to encode some key information (e.g. title) in the first field, but this is hardly ideal. This corrupts the pure-stream property of the format, which actually has some major advantages, despite the generally negative tone taken by this article. If the first record were to be displaced for any reason, the metadata will be lost.
No datatype header
Probably the most severe problem with the format is the lack of field descriptions. Data stored using the MDL SDF format is almost always extracted from some kind of table-like source, such as a spreadsheet or an SQL table. These sources not only have a limited common set of field names, they are usually also strongly typed (e.g. string, number, flag, etc.)
MDL SDF field data is completely freeform. Any field name can appear anywhere in the SD file, without warning. The content has no particular restrictions. This has two very ugly consequences for algorithms which need to read in the data and turn it back into the tabular form that it (probably) came from:
When data is exported from a typed table to an MDL SDF stream, it would be polite to ensure that all fields are present in the first record, but there is no such convention that can be assumed, especially if the data model allows null values, which can be represented by the absence of a field.
There is no reliable way around the lack of typing. The best solution is to either give up and import fields as text, or read all fields as text and do a post-processing step which tries to add typing information. For example if a field always has values which are integers or floating point values, then such a guess will frequently be right. But text fields can contain numbers, and so the guess can be wrong, especially when applied to subsets of a larger collection of data.
Data which did not originate from a strongly typed source often includes modifiers and units in a haphazard way, e.g. ">10" or "not tested" or "50 g/mol". While the need to indicate non-numeric information with a numeric field does not originate with the MDL SDF specification, this kind of ad hoc almost-numeric data is extremely commonly found in SD files. Because of this, type-guessing has to be done with great care.
One molecule per record
Each record can hold one molecule (which may be blank). This is quite sufficient for many uses, but there are plenty of examples where it would be useful to have more (e.g. the molecule before and after some operation). There is no satisfactory way to accomplish this with MDL SDF.
Some possible strategies include splitting up into multiple records, i.e. subsequent records may be associated with previous information. This is probably the best technique for most purposes, as the resulting stream can be run through many types of third party software processes without losing the information, and will likely be somewhat human-readable. Another option is to include molecule data (other than the first one) using some chosen representation, e.g. as a SMILES string, or as a multiline field, either using a subset of the MDL MOL format (the CTAB-block, which contains no blank lines) or base64 encoding. Third party software may be expected to preserve this data, but there is no standard way to make it actually usable.
Data fields can be multiline, and are terminated by a blank line. But there is no way to encode a blank line within the actual text, because it is always a terminator. If there was an "escape syntax", like the C-style use of "\n", then it would be possible to workaround this by preprocessing fields before output, and be certain of no information loss. In the following example:
The intervening blank line is considered a terminator half way through the data block. Besides demonstrating that MDL SDF is a poor choice of format for storing nursery rhymes, it means that arbitrary text cannot be encoded in a human readable form. In order to do so, some informal contract would have to be made with third party software, by way of inventing escape sequences. Or the whole block can be stored as base64, which is not human readable, and not to mention unsupported by many readers.
Because MDL SDF is a very simple and widely available format, it has been used for a great many ad hoc purposes. There are many data files that have been generated without following the specification exactly, and there are many readers that are tolerant of mistakes.
For example, blank fields are sometimes emitted without an intervening blank line. So for example if a molecule has boiling point data but no melting point, it should be encoded as:
If the null was interpreted as a blank string rather than "0 lines of data", it might be written as:
Some writers prefer to generate:
Since these varieties occur so frequently, it ceases to much matter which of them is actually correct. If a user has an SD file that can be read by 9 applications but the 10th one refuses to read it, then the odd-one-out will receive complaints until it is "fixed". This is the manner in which the format is controlled more by its abuses, and unsanctioned workarounds, than the format specification.
General purpose software for reading the MDL SDF format has to be tolerant of widespread abuses, as well as honouring everything that is permitted by the specification. And sometimes these conflict. A particularly common problem is use of the greater-than (">") character in numeric fields, since experimental measurements often provide only a boundary. For example:
For a reader which is designed to be tolerant of all types of input, this is quite confusing, since the second line, which contains the actual data, looks somewhat like it is supposed to be a new field name.
For practical purposes, MDL SDF readers that are intended for general use must test against a wide variety of badly formatted source data that is generally vouched for by users of other software. When writing MDL SDF data, it is a very good idea to read the specification and implement it with pedantic attention to detail, and when in doubt, pick the most popular choices.
This article describes a subset of the problems that will be encountered by anyone who needs to use MDL MOL and MDL SDF formats to store static molecular structures and associated data. There are plenty of issues left untouched, e.g. mixing query fields into the molecular structure, or describing polymers, combinatorial libraries, reactions, etc.
There is more than enough justification for ditching these formats when setting out to build cheminformatics software that is reliable, general and a bit more future proof. The best compromise is to try to do as good a job as possible of interconverting between these legacy formats, and to try to lose as little information as possible with each conversion.
The examples and issues raised are strongly influenced by my own personal experiences over the last decade or so. While there are most likely some mistakes or misunderstandings on my part, each case is representative of a real world problem that caused difficulty and slowed down at least one of my software projects at some point, which could have been avoided by use of a better format.
Since this article was originally written, I have published a couple of papers in the peer reviewed literature that address some of these issues: