Why does Chemistry Data need special care?

You may think that Chemistry data should be treated with the same care as any other data. It is not that simple… Chemistry, molecular structures and reactions, and the rules that go with it, require particular attention and specialised tools.

What makes Chemistry Data special?

Chemistry data aims to represent 3D molecular structures, the arrangements of the different atoms, the type of atoms, the type of relationships i.e. bonds between the atoms and their relative positions and orientation. Chemical data is usually represented by drawing or strings with letters (atomic symbols of the chemical elements), numbers and other various symbols like Greek letters , brackets, each having a specific role. This is not your classic alphanumeric data or digital image data.

A little bit of history

Since the discovery of atoms and molecules, scientists have been using representations on paper (2D) to represent the molecules that are 3D and consists of a variety of atoms and bonds. You can read more about the history of the chemical representation in this article.

Fast forward to the 20th century, molecular representations have been digitalised. Various digital formats exist using strings or array of alphanumeric characters, some are human readable, some are less human friendly, some are proprietary and impossible to manually decrypt. While all the formats aim to describe a molecular 3D structure, som formats are richer than others enable the encryption of more structural information than others.

Each of the digital formats for the chemical entities have their pros and cons. It is important to assess your need when choosing one. Here are a few commonly encountered examples (more formats are available)

.MOL (v.3000) is one of the original and most complete chemistry format but MOL files have the inconvenient to be presented as a matrix: it is difficult to read and need specific care to handle. They also from the basis of reaction files .rxn and multiple entities files: .SDF and .RDF for multiple reactions.
SMILES represent molecules as a string of the chemical elements. They have the advantage to be easily handled and can be read by humans but it becomes tricky when dealing with cyclic and unsaturated compounds. Some information such as stereochemistry my be missing.
InChi : International Chemical Identifier is a more recent addition and looks quite promising. It has the benefits of being a human readable string and is very rich chemical format.
Proprietary formats such .cdx, .cdxml, .mrv, .skc and others are linked to specific drawing tools. While “translators” exist to convert from 1 file format to another, it is difficult to work with outside of the original drawing package and are not human readable.

Why not use the chemical name of the molecules?

Molecules are named following a very specific (and also very long) set of rules developed by the International Union of Pure and Applied Chemistry (IUPAC). Using the full IUPAC names, in particular for organic compounds enables to draw the full detailed structure of the molecules. But generating these is complicated and the name can get quite long and convoluted. It is likely that you will use a shorten name rather than the IUPAC name: who says 2-acetyloxybenzoic acid when they talk about aspirin? Or they would use the traditional scientific name: acetylsalicylic acid…

I should also mention the CAS number which is an identifier produced when the molecules are added to the CAS registry. While they uniquely identify molecules, you must look up the molecule attached to a number: 58-08-2 is the CAS Registry Number for caffeine, this is not obvious from just the number.

So we can name the molecules and/or save them with structural information, what is the challenge then?

Chemical representation

Representing and naming the molecules is only the beginning of the story when working with molecules. Whatever they are a final active compound or an intermediate, there are rules to work with them in the lab but also with their digital representation.

Being a 2D representation of a 3D structure is a challenge in itself: you can draw a molecule with a different orientation for example. Some of the functional groups can be represented in different ways. For example, a nitro group (-NO₂) can have several representations depending how you draw the bonds between the nitrogen and the 2 oxygen atoms, all of them being correct due to the nature of this particular group. And this is just ONE example. Things get even more complicated when the chemists use abbreviated groups: they need to be clearly defined so everybody uses the same one for the same moiety – this is not always the case!

Chemical searching

Starting with this representation challenge, the next big challenge is searching for molecules. When searching for an exact word (text string), matching the letters in the word is sufficient. The search can be extended by applying some order like alphabetical order, or even use a synonym search for particular words or string. This letter matching in a string for a structure is not so successful in chemical searching.

When searching for a molecule, here are a few high-level things to consider. Do you want to find a molecule that:

is the exact molecule i.e. the same 3D configuration of all the atoms and their 3D orientation (stereochemistry)?
has only the right order of the atoms but not about the stereochemistry (ignore stereochemistry)?
that contains your search molecule and don’t care if there are more features to it (sub-structure search)?
that has a similar shape and/or electronic cloud (similarity search)? (and then you need to choose what algorithm to use for this e.g. Tanimoto, etc. …)

From these examples, and there are many more cases to consider, you can see that dealing with chemical structure and chemical structure search is much more complex than dealing with just text strings or numbers.

Conclusion

With these considerations in mind (and believe me, there are many other considerations depending the molecules and what you are doing with them), you can appreciate that dealing with molecular structural data requires special care.

To the challenges I just describe, you need to add the ones about chemical reactions, and all the metadata and attached data needs rigorous care too. The Chemical data is one (complex) part of the data you manage when you think FAIR data for your overall data! Taking good care of your chemistry data will enable leveraging it for innovations, possibly using chemistry specialised AI tools to further your research.

You also need to consider the people working this data. Change Management with Chemists is crucial! Understanding their needs and the way they work, engaging with them should be treated with top priority.

Fortunately, there are software and specialised people available to help. I am a chemist and data steward consultant. I have been working in the area of chemistry data management for over 20 years. If you want to discuss your chemistry data management challenges, please get in touch. Drop me a message!