Summary definition of cheminformatics: the representation, storage, retrieval and manipulation of molecular structures in a way that is both computer- and human-friendly.
Products from Molecular Materials Informatics focus on the domain of cheminformatics, which is an industry insider term for software that handles chemical structures. The field can be loosely understood as including all kinds of software that need to take a conventional picture of a molecule, and represent it in a way that allows the computer to interpret the chemistry.
Chemists are very much at home with the diagram of caffeine, shown in (a). And being a common and well known compound, the name caffeine is meaningful to many chemists, who can remember the corresponding diagram form. Those who cannot can easily use the name to look it up in a database and find it. To a computer, neither the diagram nor the name have any intrinsic meaning, but a connection-table representation, such as the encoded format given in (c), can be used to describe the raw chemistry. The symbols can be understood by a simple computer program to mean a list of atoms (carbon, nitrogen and oxygen) connected together by bonds. A computer program that is capable of understanding the raw datastructure can be expanded in capabilities so that it can draw its own diagram, be associated with other references such as name, literature publications, physical properties, biological effects such as drug target binding, and any other kinds of metadata.
This is the basis of cheminformatics: the first step is to capture the molecules using a datastructure that software can work with. Then molecules are associated with higher levels of metadata, associations and scientific measurements. Increasingly sophisticated software can then be brought in, to store the data and make it available for easy retrieval; to turn the data into attractive diagrams suitable for publications; to study how the molecules relate to each other and hence predict as-yet-unmeasured properties; or to use the data as inputs for high level calculations, such as ab initio quantum chemistry or candidates for protein-ligand docking.
One of the most important distinctions to make at the outset is that just because a chemical structure has been digitised and can be displayed by a computer, does not mean that the computer can do anything intelligent with it.
To make this point most clearly, consider this very pixelated bitmapped image diagram:
In this case, the chemical structure of caffeine has been served up as a file using the PNG (Portable Network Graphics) format. This representation has no chemical information at all: it is simply a grid of square pixels, each of which has a colour associated with it. There are many things a computer can do with it, such as display it as a small thumbnail, but when it comes to actual chemistry, the meaning has all been lost. While it is stored as a computer file, which you are viewing on your browser right now, for informatics purposes it might as well be a scribble on a paper napkin.
Bitmap files are not the only structure representations that are not part of the cheminformatics domain. Vector graphics formats, while they render much more beautifully on high resolution devices, also typically have no chemistry information associated with them. Whether the content is from a PDF file, a Microsoft Word document or an SVG diagram, chances are the information necessary to allow a software package to understand the chemistry is not there. It is theoretically possible to embed additional data with this content, but it is rare, so chances are any nonproprietary graphical representation of a molecule is as devoid of meaningful content as a low resolution bitmap.
Even data that is stored using file formats designed explicitly for chemistry are not necessary valid from a cheminformatics standpoint. Consider the following scheme:
This makes a great deal of sense to anyone with chemistry training, and there are any number of software packages designed to help chemists draw diagrams like this. These applications are fantastically useful for creating content for communicating from one scientist to another, e.g. manuscripts, presentations, posters, etc.
The problem, though, is that many of the objects in this scheme have no well defined meaning to a computer algorithm that is trying to determine what the operator is describing. While the atoms and bonds are properly defined, everything else is a black box: the software knows how to display it, but not clearly what it means. One could design a software algorithm that analyses these atoms, bonds and miscellaneous graphical objects and figure out that it is describing a class of SN2 nucleophilic substitutions, but that would not be helpful, because a slightly different diagram (e.g. a figurative description of an SN1 reaction, or even the same concept expressed in a slightly different style) would likely fool the interpreter, which would draw the wrong conclusion.
A molecular structure belongs in the domain of cheminformatics if, and only if, all of the necessary concepts are described using machine-readable objects that have well defined meaning. What this means in practice is that:
- Almost all kinds of pictures are not part of cheminformatics (unless the original data is embedded)
- Most chemistry-specific diagram editing software allows use of drawing objects which disqualify the data
Most of the popular chemical structure drawing applications are actually a superset of cheminformatics: they allow the valid objects to be captured in a meaningful datastructure, but they also provide many other features that are useful for presentation, but if they are used, make the information content invalid.
Software that is designed to operate in the domain of cheminformatics must necessarily take a minimalistic approach to describing molecular entities, since everything most have a clear, well defined, well documented and ideally completely unambiguous meaning. For many molecules, this is quite straightforward, because all of the necessary information can be encoded as a list of atoms, each of which corresponds to an element in the periodic table, a handful of annotations about each of these atoms, and a list of bonds connecting them together. Each atom is given a position, which makes it much easier to create a diagram out of the raw content, provides resolution of stereochemistry, and preserves some of the aesthetic nuance that the scientist wished to convey (e.g. orientation).
Cheminformatics software is not restricted to a small set of data objects, but it needs to designed as layers that are added over top of the fundamentals. By presenting the integrity of the core molecular representation, an incredible amount of scientifically important functionality can be developed:
Individual molecular structures can be annotated in any number of optional ways, whether for aesthetic decoration or with non-essential properties (e.g. partial charges, either measured or calculated). These structure can be fed into algorithms designed to create rendered diagrams, for sharing with humans, using a variety of popular graphics formats. The molecules can be folder into larger datastructure, such as chemical reactions (each of which involves several distinct molecular species), data about the molecule (names, physical properties, etc.), collections of molecules and data, or even higher order structures where molecules are described as component fragments. These additional layers can be combined together to form large registration databases, which can be searched using chemically intelligent methods (e.g. substructure or structural similarity), clustered in order to visualise bulk structure-property relationships, fed into machine learning algorithms to use known numeric properties to calculate missing properties, or make use of this data to recombine compounds in a multitude of ways in which software algorithms can leverage their analysis to propose new structures of interest.
To achieve useful results in cheminformatics, it is necessary to practice careful discipline with regard to data integrity and quality, both for the core representation of the structures, and every step along the way. There are numerous ways to lose integrity: input data formats that are insufficiently well designed, lax data entry standards or promiscuous data importing policies, and algorithmic transformations that are not lossless. A good analogy can be found in audio technology: if you take a scratched record, or play a good record on a poor quality turntable, and transcribe the music to a cassette tape, then make some mix tapes by tape-to-tape copying, it will take surprisingly few steps before the results are painful to listen to. By contrast, a music transfer system that is purely digital can sidestep most of these problems: if you rip a compact disc to a suitably high resolution format, you can copy, share and remix it as many times as you like, and only the most scrupulously fussy music nerd will be able to tell the difference.
Most contemporary chemistry structure and data manipulation tools, and the best practices that go along with them, are more comparable to the analog cassette tape metaphor than the (almost) lossless digital media ideal. There are exceptions, but it is typically left to the user to determine which tools are appropriate to which situation, and this requires a significant amount of field expertise.
All of the products and tools and services produced by Molecular Materials Informatics adhere to a puritanical definition of cheminformatics and chemical information, whether they be apps, webapps, desktop applications or cloud-hosted services. At the core of everything is the minimalistic chemical structure designed for lossless (yet extensible) translation and manipulation; these are wrapped together using a similarly well defined datasheet format, and manipulated by algorithms and user interfaces that take great pains to preserve the integrity of the data, at all costs. Even if it means that software is a little bit harder to learn, or that seemingly obvious features are missing.
The study of cheminformatics is an esoteric discipline that has much in common with popular software in general use by all kinds of chemists. The domain is very powerful, and serves as the basis of powerful frameworks for managing and predicting scientific data, but care and attention is necessary to ensure that the fundamentals retain their integrity throughout the diverse and numerous operations that are possible.